Notebook for Preprocessing & Saving Preprocessed Data

Loads from the RAW layer:
- stock_prices.csv 
- us_income_statements.csv 
- Merges them using SimFinId as the key.
- Saves the merged output as merged_stock_income.csv. in the PREPROCESSED Folder

This notebook logically follows the merge step and performs data cleaning & preprocessing on merged_stock_income.csv

✔ What it does:
- Loads from the RAW layer stock_prices.csv and us_income_statements.csv, and merges them using SimFinId as the key.

Then, it cleans and prepares the data:
- Drops the "Dividend" column (mostly null).
Converts "Date" to datetime.
- Handles nulls in "Shares Outstanding" intelligently using group-wise .last() and .map().
- Drops or fills missing values using thresholds, medians (for numeric), and modes (for categorical).
- Applies linear interpolation for remaining NaNs.
- Saves the cleaned version in the PREPROC folder.

In [1]:
import pandas as pd
import os

# Load the stock prices dataset
df_prices = pd.read_csv("data\RAW\stock_prices.csv") # ADAPT TO YOUR LOCAL ENVIRONMENT


# Print column names to verify existence
df_prices.info()

  df_prices = pd.read_csv("data\RAW\stock_prices.csv") # ADAPT TO YOUR LOCAL ENVIRONMENT


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5871346 entries, 0 to 5871345
Data columns (total 11 columns):
 #   Column              Dtype  
---  ------              -----  
 0   Ticker              object 
 1   Date                object 
 2   SimFinId            int64  
 3   Open                float64
 4   High                float64
 5   Low                 float64
 6   Close               float64
 7   Adj. Close          float64
 8   Volume              int64  
 9   Dividend            float64
 10  Shares Outstanding  float64
dtypes: float64(7), int64(2), object(2)
memory usage: 492.7+ MB


In [2]:
df_prices[df_prices['Ticker']=='PG']

Unnamed: 0,Ticker,Date,SimFinId,Open,High,Low,Close,Adj. Close,Volume,Dividend,Shares Outstanding
4074419,PG,2019-04-26,133209,103.95,105.88,103.91,105.86,91.79,7747682,,2.508330e+09
4074420,PG,2019-04-29,133209,105.57,105.75,104.72,104.78,90.86,4787561,,2.508330e+09
4074421,PG,2019-04-30,133209,104.86,106.62,104.66,106.48,92.33,8251321,,2.508330e+09
4074422,PG,2019-05-01,133209,106.15,106.39,104.81,104.93,90.99,6734075,,2.508330e+09
4074423,PG,2019-05-02,133209,105.20,105.76,105.05,105.56,91.53,6270323,,2.508330e+09
...,...,...,...,...,...,...,...,...,...,...,...
4075654,PG,2024-03-22,133209,162.20,162.41,161.47,161.66,157.77,6393425,,2.353021e+09
4075655,PG,2024-03-25,133209,161.17,161.66,159.73,160.19,156.34,7145692,,2.353021e+09
4075656,PG,2024-03-26,133209,160.36,161.14,160.14,160.55,156.69,5842850,,2.353021e+09
4075657,PG,2024-03-27,133209,161.36,162.74,161.34,162.61,158.70,6599711,,2.353021e+09


In [3]:
# Load the income statements dataset
df_income = pd.read_csv("data/RAW/us_income_statements.csv")

df_income.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17555 entries, 0 to 17554
Data columns (total 26 columns):
 #   Column                                    Non-Null Count  Dtype  
---  ------                                    --------------  -----  
 0   SimFinId                                  17555 non-null  int64  
 1   Currency                                  17555 non-null  object 
 2   Fiscal Year                               17555 non-null  int64  
 3   Fiscal Period                             17555 non-null  object 
 4   Publish Date                              17555 non-null  object 
 5   Restated Date                             17555 non-null  object 
 6   Shares (Basic)                            17403 non-null  float64
 7   Shares (Diluted)                          17276 non-null  float64
 8   Revenue                                   15745 non-null  float64
 9   Cost of Revenue                           13770 non-null  float64
 10  Gross Profit                      

In [4]:
df_income[df_income['SimFinId']==133209]

Unnamed: 0,SimFinId,Currency,Fiscal Year,Fiscal Period,Publish Date,Restated Date,Shares (Basic),Shares (Diluted),Revenue,Cost of Revenue,...,Non-Operating Income (Loss),"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Extraordinary Gains (Losses),Net Income,Net Income (Common)
12045,133209,USD,2019,FY,2019-08-06,2021-08-06,2503600000.0,2539500000.0,67684000000.0,-34768000000.0,...,582000000.0,-289000000.0,14414000000.0,-8345000000.0,6069000000.0,-2103000000.0,3966000000,,3897000000,3897000000
12046,133209,USD,2020,FY,2020-08-06,2022-08-05,2487100000.0,2625800000.0,70950000000.0,-35250000000.0,...,128000000.0,-310000000.0,15834000000.0,,15834000000.0,-2731000000.0,13103000000,,13027000000,13027000000
12047,133209,USD,2021,FY,2021-08-06,2023-08-04,2473700000.0,2610400000.0,76118000000.0,-37108000000.0,...,-371000000.0,-457000000.0,17615000000.0,,17615000000.0,-3263000000.0,14352000000,,14306000000,14306000000
12048,133209,USD,2022,FY,2022-08-05,2024-08-05,2514236000.0,2601000000.0,80187000000.0,-42157000000.0,...,182000000.0,-388000000.0,17995000000.0,,17995000000.0,-3202000000.0,14793000000,,14742000000,14742000000
12049,133209,USD,2023,FY,2023-08-04,2024-08-05,2414003000.0,2483900000.0,82006000000.0,-42760000000.0,...,219000000.0,-449000000.0,18353000000.0,,18353000000.0,-3615000000.0,14738000000,,14653000000,14653000000


In [5]:
unique_simfinid_left = df_merged['SimFinId'].unique()
unique_simfinid_right = df_merged['SimFinId_y'].unique()

print("Unique SimFinId in left DataFrame:", unique_simfinid_left)
print("Unique SimFinId in right DataFrame:", unique_simfinid_right)

NameError: name 'df_merged' is not defined

In [6]:
# Step 1: Convert dates to datetime
df_prices['Date'] = pd.to_datetime(df_prices['Date'])
df_income['Publish Date'] = pd.to_datetime(df_income['Publish Date'])


# Step 2: Sort BOTH DataFrames by SimFinId and their respective date fields
df_prices_sorted = df_prices.sort_values(by=['SimFinId', 'Date']).reset_index(drop=True)
df_income_sorted = df_income.sort_values(by=['SimFinId', 'Publish Date']).reset_index(drop=True)

In [7]:
# Make sure both date columns are datetime type
df_prices_sorted['Date'] = pd.to_datetime(df_prices_sorted['Date'])
df_income_sorted['Publish Date'] = pd.to_datetime(df_income_sorted['Publish Date'])

# Sort both DataFrames by the merge keys
df_prices_sorted = df_prices_sorted.sort_values(by=['SimFinId', 'Date'])
df_income_sorted = df_income_sorted.sort_values(by=['SimFinId', 'Publish Date'])

# Perform the merge using groupby and apply
def merge_asof_group(group):
    return pd.merge_asof(
        group[0],
        group[1],
        left_on='Date',
        right_on='Publish Date',
        direction='backward'
    )

# Group by 'SimFinId' and apply the merge_asof function
df_merged = df_prices_sorted.groupby('SimFinId').apply(
    lambda x: merge_asof_group((x, df_income_sorted[df_income_sorted['SimFinId'] == x.name]))
).reset_index(drop=True)

  df_merged = df_prices_sorted.groupby('SimFinId').apply(


In [8]:
# Check if there are any mismatches in the new merged DataFrame
df_merged['SimFinId_match'] = df_merged['SimFinId_x'] == df_merged['SimFinId_y']
mismatch_rows_correct = df_merged[df_merged['SimFinId_match'] == False]
# Display the mismatch rows in the new merged DataFrame
mismatch_rows_correct[['Ticker','Date','Close','Revenue','SimFinId_x','SimFinId_y', 'SimFinId_match']]

Unnamed: 0,Ticker,Date,Close,Revenue,SimFinId_x,SimFinId_y,SimFinId_match
0,GOOG,2019-04-26,63.61,,18,,False
1,GOOG,2019-04-29,64.38,,18,,False
2,GOOG,2019-04-30,59.42,,18,,False
3,GOOG,2019-05-01,58.40,,18,,False
4,GOOG,2019-05-02,58.13,,18,,False
...,...,...,...,...,...,...,...
5871341,NYXH,2024-03-22,14.72,,18589408,,False
5871342,NYXH,2024-03-25,13.56,,18589408,,False
5871343,NYXH,2024-03-26,13.37,,18589408,,False
5871344,NYXH,2024-03-27,13.26,,18589408,,False


In [9]:
df_merged.rename(columns={'SimFinId_x':'SimFinId'}, inplace=True)
df_merged.drop(columns=['SimFinId_y','SimFinId_match'], inplace=True)

In [10]:
df_merged[df_merged['SimFinId']==133209]

Unnamed: 0,Ticker,Date,SimFinId,Open,High,Low,Close,Adj. Close,Volume,Dividend,...,Non-Operating Income (Loss),"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Extraordinary Gains (Losses),Net Income,Net Income (Common)
403724,PG,2019-04-26,133209,103.95,105.88,103.91,105.86,91.79,7747682,,...,,,,,,,,,,
403725,PG,2019-04-29,133209,105.57,105.75,104.72,104.78,90.86,4787561,,...,,,,,,,,,,
403726,PG,2019-04-30,133209,104.86,106.62,104.66,106.48,92.33,8251321,,...,,,,,,,,,,
403727,PG,2019-05-01,133209,106.15,106.39,104.81,104.93,90.99,6734075,,...,,,,,,,,,,
403728,PG,2019-05-02,133209,105.20,105.76,105.05,105.56,91.53,6270323,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
404959,PG,2024-03-22,133209,162.20,162.41,161.47,161.66,157.77,6393425,,...,219000000.0,-449000000.0,1.835300e+10,,1.835300e+10,-3.615000e+09,1.473800e+10,,1.465300e+10,1.465300e+10
404960,PG,2024-03-25,133209,161.17,161.66,159.73,160.19,156.34,7145692,,...,219000000.0,-449000000.0,1.835300e+10,,1.835300e+10,-3.615000e+09,1.473800e+10,,1.465300e+10,1.465300e+10
404961,PG,2024-03-26,133209,160.36,161.14,160.14,160.55,156.69,5842850,,...,219000000.0,-449000000.0,1.835300e+10,,1.835300e+10,-3.615000e+09,1.473800e+10,,1.465300e+10,1.465300e+10
404962,PG,2024-03-27,133209,161.36,162.74,161.34,162.61,158.70,6599711,,...,219000000.0,-449000000.0,1.835300e+10,,1.835300e+10,-3.615000e+09,1.473800e+10,,1.465300e+10,1.465300e+10


In [11]:
df_merged.describe()

Unnamed: 0,Date,SimFinId,Open,High,Low,Close,Adj. Close,Volume,Dividend,Shares Outstanding,...,Non-Operating Income (Loss),"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Extraordinary Gains (Losses),Net Income,Net Income (Common)
count,5871346,5871346.0,5871346.0,5871346.0,5871346.0,5871346.0,5871346.0,5871346.0,35886.0,5327273.0,...,3566327.0,3165979.0,3619445.0,2334826.0,3619844.0,2910392.0,3620344.0,372262.0,3620344.0,3620344.0
mean,2021-11-10 20:02:07.225614080,5949627.0,16158.0,16256.47,16023.93,16115.22,16113.06,1783667.0,0.455062,598837600000.0,...,-138623000.0,-208563700.0,1037328000.0,-47636180.0,919536300.0,-277946900.0,785406200.0,93369070.0,782776300.0,780012700.0
min,2019-04-26 00:00:00,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1007982000000.0,-1381290000000.0,-1574245000000.0,-31769000000.0,-1574245000000.0,-505456000000.0,-1394512000000.0,-4306000000.0,-1409383000000.0,-1409383000000.0
25%,2020-09-04 00:00:00,627774.0,7.8,8.01,7.57,7.8,7.32,35978.0,0.12,18106150.0,...,-47000000.0,-67200000.0,-29154000.0,-40840000.0,-40438000.0,-51605000.0,-40370040.0,-8467000.0,-39817000.0,-40744940.0
50%,2021-11-29 00:00:00,1326823.0,20.25,20.7,19.84,20.25,18.83,243588.0,0.25,49044990.0,...,-4399000.0,-9791000.0,7398000.0,-4197610.0,3267000.0,-4300000.0,2493000.0,37000.0,2540000.0,2050922.0
75%,2023-01-24 00:00:00,11035980.0,51.72,52.65,50.75,51.69,48.74,938577.8,0.45,131358000.0,...,132000.0,-236000.0,201968000.0,924000.0,179000000.0,-8000.0,147068000.0,25100000.0,142700000.0,141214000.0
max,2024-03-28 00:00:00,18589410.0,100000000.0,100000000.0,100000000.0,100000000.0,100000000.0,18489980000.0,1500.0,6667887000000000.0,...,425846000000.0,201704000000.0,2710835000000.0,280999000000.0,2710835000000.0,736601000000.0,2205379000000.0,21827000000.0,2201565000000.0,2201565000000.0
std,,6099718.0,1220804.0,1223978.0,1216359.0,1219076.0,1219076.0,28725760.0,7.994542,61718900000000.0,...,7746022000.0,7944650000.0,43240300000.0,3873273000.0,44450490000.0,8980231000.0,35356880000.0,705171700.0,35343300000.0,35343250000.0


In [12]:
df_merged.isna().mean() # 99% of the dividend column is null for all tickers 

Ticker                                      0.000000
Date                                        0.000000
SimFinId                                    0.000000
Open                                        0.000000
High                                        0.000000
Low                                         0.000000
Close                                       0.000000
Adj. Close                                  0.000000
Volume                                      0.000000
Dividend                                    0.993888
Shares Outstanding                          0.092666
Currency                                    0.383388
Fiscal Year                                 0.383388
Fiscal Period                               0.383388
Publish Date                                0.383388
Restated Date                               0.383388
Shares (Basic)                              0.387846
Shares (Diluted)                            0.392716
Revenue                                     0.

# Data Cleaning #

✅ Converted 'Date' into a Datetime Object

✅ Remove duplicate rows if needed

✅ Handled missing values


✅ Drop dividend column (with 99% of nulls)

✅ Shares outstanding -  Replace Nulls with the Latest Value Reported per Ticker 

✅ Shares outstanding -  Replace remaining nulls (for those tickers who have never reported a shares outstanding value) with 0 to preserve data integrity 

✅ Drop Columns with Too Many Missing Values (>50%)

✅ Fill Numerical Columns with Mean/Median imputation (e.g. 'Revenue', 'Cost of Revenue', 'Gross Profit', 'Operating Expenses')

✅ Fill Categorical Columns with Mode ('Currency', 'Ticker', 'Fiscal Period')

✅ Drop Rows with Too Many Nulls
If some rows still have excessive missing values, drop them.

✅ Interpolation for Time-Series Data

❌ Outliers - not needed for cleaning. 

Changing Date into a Datetime Object

In [13]:
df_merged['Date'] = pd.to_datetime(df_merged['Date']) # converting the "Date" column to a Datetime Object 

Checking for Duplicates


In [14]:
df_merged.duplicated().sum()  # Count duplicates

np.int64(0)

Dropping the "Divident" column with 99% of nulls 

In [15]:
df_merged = df_merged.drop(columns=["Dividend"])

Handling the null values in "Shares Outstanding" 

In [16]:
# Sort the DataFrame by Ticker and Date in ascending order
df_merged = df_merged.sort_values(by=["Ticker", "Date"])

# Get the latest (most recent) non-null "Shares Outstanding" per Ticker
latest_shares_outstanding = df_merged.groupby("Ticker")["Shares Outstanding"].last()

# Fill missing values using the latest reported value per Ticker
df_merged["Shares Outstanding"] = df_merged["Shares Outstanding"].fillna(df_merged["Ticker"].map(latest_shares_outstanding))

In [17]:
df_merged.isna().mean()

Ticker                                      0.000000
Date                                        0.000000
SimFinId                                    0.000000
Open                                        0.000000
High                                        0.000000
Low                                         0.000000
Close                                       0.000000
Adj. Close                                  0.000000
Volume                                      0.000000
Shares Outstanding                          0.025208
Currency                                    0.383388
Fiscal Year                                 0.383388
Fiscal Period                               0.383388
Publish Date                                0.383388
Restated Date                               0.383388
Shares (Basic)                              0.387846
Shares (Diluted)                            0.392716
Revenue                                     0.442331
Cost of Revenue                             0.

The reason some “Shares Outstanding” values are still null, even after replacing them with the latest available value, is that .last() in groupby("Ticker") selects only the most recent non-null value per Ticker. However, if a Ticker never had a reported non-null value, .last() returns NaN instead of a valid number. As a result, when we try to map and fill missing values, there is no valid value available for replacement, leaving some entries still null.

In [18]:
null_so =  df_merged[df_merged["Shares Outstanding"].isna()]

null_so

Unnamed: 0,Ticker,Date,SimFinId,Open,High,Low,Close,Adj. Close,Volume,Shares Outstanding,...,Non-Operating Income (Loss),"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Extraordinary Gains (Losses),Net Income,Net Income (Common)
5585600,ACDC,2022-05-13,17623433,17.60,18.95,17.60,18.11,18.11,6325615,,...,,,,,,,,,,
5585601,ACDC,2022-05-16,17623433,18.00,18.59,17.16,17.99,17.99,1283643,,...,,,,,,,,,,
5585602,ACDC,2022-05-17,17623433,18.22,18.37,17.68,18.08,18.08,898019,,...,,,,,,,,,,
5585603,ACDC,2022-05-18,17623433,18.05,18.30,17.77,17.98,17.98,790809,,...,,,,,,,,,,
5585604,ACDC,2022-05-19,17623433,17.82,18.00,16.75,16.88,16.88,2122381,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5699235,ZWS,2024-03-22,17663788,32.91,33.05,32.53,32.57,32.25,1071076,,...,,,,,,,,,,
5699236,ZWS,2024-03-25,17663788,32.45,32.75,32.27,32.32,32.00,575686,,...,,,,,,,,,,
5699237,ZWS,2024-03-26,17663788,32.33,32.60,32.21,32.31,31.99,559464,,...,,,,,,,,,,
5699238,ZWS,2024-03-27,17663788,32.53,32.78,32.33,32.76,32.44,662613,,...,,,,,,,,,,


Replacing the Shares Outstanding Values that have never had a reported non-null vallue with 0. 

In [110]:
df_merged["Shares Outstanding"] = df_merged["Shares Outstanding"].fillna(0) # replacing the remaining "Shares Outstanding" with 0


In [111]:
null_means = df_merged.isna().mean()
for column, mean in null_means.items():
    if mean > 0:
        print(f"{column}: {mean:.6f}")

Currency: 0.383824
Fiscal Year: 0.383824
Fiscal Period: 0.383824
Publish Date: 0.383824
Restated Date: 0.383824
Shares (Basic): 0.388278
Shares (Diluted): 0.393145
Revenue: 0.442725
Cost of Revenue: 0.512134
Gross Profit: 0.512154
Operating Expenses: 0.385213
Selling, General & Administrative: 0.418883
Research & Development: 0.718494
Depreciation & Amortization: 0.746623
Operating Income (Loss): 0.383977
Non-Operating Income (Loss): 0.393017
Interest Expense, Net: 0.461156
Pretax Income (Loss), Adj.: 0.383977
Abnormal Gains (Losses): 0.602616
Pretax Income (Loss): 0.383909
Income Tax (Expense) Benefit, Net: 0.504657
Income (Loss) from Continuing Operations: 0.383824
Net Extraordinary Gains (Losses): 0.936642
Net Income: 0.383824
Net Income (Common): 0.383824


Drop Columns with Too Many Missing Values (>70%)
If a column has too many missing values and isn't critical, it's best to drop it.

columns dropped: Research & Development, Depreciation & Amortization

In [112]:
threshold = 0.7  # 70% missing values threshold
df_merged = df_merged.dropna(axis=1, thresh=len(df_merged) * (1 - threshold))

Fill the missing values with the next available or latest income information for each company. By applying both ffill and bfill on the entire DataFrame within each SimFinId group, we ensure that missing values are filled with the closest available data, whether it comes from a previous or a future date. This should help fill in the missing values more effectively.

In [113]:
# Sort the DataFrame by SimFinId and Date
df_merged = df_merged.sort_values(by=['SimFinId', 'Date'])

# Forward fill and backward fill missing values within each SimFinId group
df_merged = df_merged.groupby('SimFinId').apply(lambda group: group.ffill().bfill())

# Reset index after groupby
df_merged = df_merged.reset_index(drop=True)

# Check the merged result for a specific SimFinId
df_merged[df_merged['SimFinId'] == 133209]

  df_merged = df_merged.groupby('SimFinId').apply(lambda group: group.ffill().bfill())
  df_merged = df_merged.groupby('SimFinId').apply(lambda group: group.ffill().bfill())


Unnamed: 0,Ticker,Date,SimFinId,Open,High,Low,Close,Adj. Close,Volume,Shares Outstanding,...,Non-Operating Income (Loss),"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Income,Net Income (Common),SimFinId_match
404073,PG,2019-04-25,133209,103.10,103.57,102.59,103.28,89.56,6085314,2.508330e+09,...,582000000.0,-289000000.0,1.441400e+10,-8.345000e+09,6.069000e+09,-2.103000e+09,3.966000e+09,3.897000e+09,3.897000e+09,False
404074,PG,2019-04-26,133209,103.95,105.88,103.91,105.86,91.79,7747682,2.508330e+09,...,582000000.0,-289000000.0,1.441400e+10,-8.345000e+09,6.069000e+09,-2.103000e+09,3.966000e+09,3.897000e+09,3.897000e+09,False
404075,PG,2019-04-29,133209,105.57,105.75,104.72,104.78,90.86,4787561,2.508330e+09,...,582000000.0,-289000000.0,1.441400e+10,-8.345000e+09,6.069000e+09,-2.103000e+09,3.966000e+09,3.897000e+09,3.897000e+09,False
404076,PG,2019-04-30,133209,104.86,106.62,104.66,106.48,92.33,8251321,2.508330e+09,...,582000000.0,-289000000.0,1.441400e+10,-8.345000e+09,6.069000e+09,-2.103000e+09,3.966000e+09,3.897000e+09,3.897000e+09,False
404077,PG,2019-05-01,133209,106.15,106.39,104.81,104.93,90.99,6734075,2.508330e+09,...,582000000.0,-289000000.0,1.441400e+10,-8.345000e+09,6.069000e+09,-2.103000e+09,3.966000e+09,3.897000e+09,3.897000e+09,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
405309,PG,2024-03-22,133209,162.20,162.41,161.47,161.66,157.77,6393425,2.353021e+09,...,219000000.0,-449000000.0,1.835300e+10,-8.345000e+09,1.835300e+10,-3.615000e+09,1.473800e+10,1.465300e+10,1.465300e+10,True
405310,PG,2024-03-25,133209,161.17,161.66,159.73,160.19,156.34,7145692,2.353021e+09,...,219000000.0,-449000000.0,1.835300e+10,-8.345000e+09,1.835300e+10,-3.615000e+09,1.473800e+10,1.465300e+10,1.465300e+10,True
405311,PG,2024-03-26,133209,160.36,161.14,160.14,160.55,156.69,5842850,2.353021e+09,...,219000000.0,-449000000.0,1.835300e+10,-8.345000e+09,1.835300e+10,-3.615000e+09,1.473800e+10,1.465300e+10,1.465300e+10,True
405312,PG,2024-03-27,133209,161.36,162.74,161.34,162.61,158.70,6599711,2.353021e+09,...,219000000.0,-449000000.0,1.835300e+10,-8.345000e+09,1.835300e+10,-3.615000e+09,1.473800e+10,1.465300e+10,1.465300e+10,True


In [114]:
df_merged.isna().mean()

Ticker                                      0.000000
Date                                        0.000000
SimFinId                                    0.000000
Open                                        0.000000
High                                        0.000000
Low                                         0.000000
Close                                       0.000000
Adj. Close                                  0.000000
Volume                                      0.000000
Shares Outstanding                          0.000000
Currency                                    0.233609
Fiscal Year                                 0.233609
Fiscal Period                               0.233609
Publish Date                                0.233609
Restated Date                               0.233609
Shares (Basic)                              0.240664
Shares (Diluted)                            0.243034
Revenue                                     0.285990
Cost of Revenue                             0.

Fill Categorical Columns with Mode
For categorical columns like Currency, Ticker, or Fiscal Period.

- Categorical columns like Currency, Ticker, and Fiscal Period contain discrete values.
- The mode is the most frequently occurring category, making it a logical replacement for missing values.
- If we later use one-hot encoding or label encoding, missing values could create problems.
- Filling with the mode ensures that all rows have valid categorical values.

In [115]:
# Fill missing values for categorical columns
cat_cols = ['Currency', 'Ticker']
for col in cat_cols:
    df_merged[col] = df_merged.groupby('SimFinId')[col].transform(lambda x: x.fillna(x.mode()[0] if not x.mode().empty else ''))

# Fill missing values for Fiscal Period using the year from Publish Date
df_merged['Fiscal Period'] = df_merged['Fiscal Period'].fillna(df_merged['Publish Date'].dt.year.astype(str))

# Check the merged result 
df_merged.isna().mean()


Ticker                                      0.000000
Date                                        0.000000
SimFinId                                    0.000000
Open                                        0.000000
High                                        0.000000
Low                                         0.000000
Close                                       0.000000
Adj. Close                                  0.000000
Volume                                      0.000000
Shares Outstanding                          0.000000
Currency                                    0.000000
Fiscal Year                                 0.233609
Fiscal Period                               0.000000
Publish Date                                0.233609
Restated Date                               0.233609
Shares (Basic)                              0.240664
Shares (Diluted)                            0.243034
Revenue                                     0.285990
Cost of Revenue                             0.

Drop Rows with Too Many Nulls
If some rows still have excessive missing values, drop them.

In [117]:
df_merged.dropna(thresh=len(df_merged.columns) * 0.7, inplace=True)  # Drops rows with >30% missing


Interpolation for Time-Series Data

Uses linear interpolation to estimate missing values between known values.
Works by computing a straight-line equation between two known points and filling missing values accordingly.
- If your dataset follows a time pattern (e.g., financial data), use interpolation.
- For time-series financial data (e.g., stock prices, revenue), missing values often occur due to holidays, reporting delays, or system gaps.
- Interpolation helps smooth data trends rather than using arbitrary imputation like mean/median.
- It ensures that missing values follow the trend instead of creating unnatural spikes.

In [118]:
df_merged = df_merged.sort_values('Date')  # Ensure sorting before interpolation
df_merged.interpolate(method='linear', inplace=True)


  df_merged.interpolate(method='linear', inplace=True)


In [None]:
# Calculate the percentage of missing values for each column
missing_values = df_merged.isna().mean()
# Filter columns with missing values greater than 0
missing_columns = missing_values[missing_values > 0]

missing_columns

Interest Expense, Net    2.220967e-07
dtype: float64

In [None]:
# Fill missing values with 0 for the columns with missing values
df_merged[missing_columns.index] = df_merged[missing_columns.index].fillna(0)

# Check that there are no more columns with missing values 
missing_columns_after_fill = df_merged.isna().mean()
missing_columns_after_fill[missing_columns_after_fill > 0]

Series([], dtype: float64)

In [127]:
os.makedirs("data/PREPROCESSING", exist_ok=True)

In [128]:
# Save merged dataset
df_merged.to_csv("data/PREPROCESSING/merged_stock_income.csv", index=False) 