#  3_Enrico_Pre-Processing_Draft.ipynb

This notebook logically follows the merge step and performs data cleaning & preprocessing on merged_stock_income.csv from 2_Enrico_Merged_Draft.ipynb.

✔ What it does:
- Loads the merged dataset from the previous step.

Cleans and prepares the data:
- Drops the "Dividend" column (mostly null).
Converts "Date" to datetime.
- Handles nulls in "Shares Outstanding" intelligently using group-wise .last() and .map().
- Drops or fills missing values using thresholds, medians (for numeric), and modes (for categorical).
- Applies linear interpolation for remaining NaNs.
- Saves the cleaned version (in commented cell).

In [None]:
import pandas as pd

# Save merged dataset
df_merged = pd.read_csv("/Users/enricotajanlangit/Desktop/merged_stock_income.csv") # - ADAPT TO YOUR LOCAL ENVIRONMENT

df_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18995999 entries, 0 to 18995998
Data columns (total 36 columns):
 #   Column                                    Dtype  
---  ------                                    -----  
 0   Ticker                                    object 
 1   Date                                      object 
 2   SimFinId                                  int64  
 3   Open                                      float64
 4   High                                      float64
 5   Low                                       float64
 6   Close                                     float64
 7   Adj. Close                                float64
 8   Volume                                    int64  
 9   Dividend                                  float64
 10  Shares Outstanding                        float64
 11  Currency                                  object 
 12  Fiscal Year                               int64  
 13  Fiscal Period                             object 
 14  

In [3]:
df_merged.head()

Unnamed: 0,Ticker,Date,SimFinId,Open,High,Low,Close,Adj. Close,Volume,Dividend,...,Non-Operating Income (Loss),"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Extraordinary Gains (Losses),Net Income,Net Income (Common)
0,A,2019-04-15,45846,81.0,81.13,79.91,80.4,77.22,1627268,,...,-22000000.0,-38000000.0,919000000.0,,919000000.0,152000000.0,1071000000,,1071000000,1071000000
1,A,2019-04-15,45846,81.0,81.13,79.91,80.4,77.22,1627268,,...,-4000000.0,-70000000.0,842000000.0,,842000000.0,-123000000.0,719000000,,719000000,719000000
2,A,2019-04-15,45846,81.0,81.13,79.91,80.4,77.22,1627268,,...,13000000.0,-79000000.0,1360000000.0,,1360000000.0,-150000000.0,1210000000,,1210000000,1210000000
3,A,2019-04-15,45846,81.0,81.13,79.91,80.4,77.22,1627268,,...,-114000000.0,-75000000.0,1504000000.0,,1504000000.0,-250000000.0,1254000000,,1254000000,1254000000
4,A,2019-04-15,45846,81.0,81.13,79.91,80.4,77.22,1627268,,...,-11000000.0,-44000000.0,1339000000.0,,1339000000.0,-99000000.0,1240000000,,1240000000,1240000000


In [5]:
df_merged.columns

Index(['Ticker', 'Date', 'SimFinId', 'Open', 'High', 'Low', 'Close',
       'Adj. Close', 'Volume', 'Dividend', 'Shares Outstanding', 'Currency',
       'Fiscal Year', 'Fiscal Period', 'Publish Date', 'Restated Date',
       'Shares (Basic)', 'Shares (Diluted)', 'Revenue', 'Cost of Revenue',
       'Gross Profit', 'Operating Expenses',
       'Selling, General & Administrative', 'Research & Development',
       'Depreciation & Amortization', 'Operating Income (Loss)',
       'Non-Operating Income (Loss)', 'Interest Expense, Net',
       'Pretax Income (Loss), Adj.', 'Abnormal Gains (Losses)',
       'Pretax Income (Loss)', 'Income Tax (Expense) Benefit, Net',
       'Income (Loss) from Continuing Operations',
       'Net Extraordinary Gains (Losses)', 'Net Income',
       'Net Income (Common)'],
      dtype='object')

In [6]:
# example DF for CRWD ticker only 
df_crwd = df_merged[df_merged["Ticker"] == "CRWD"]

# Display the filtered DataFrame
df_crwd.head()

Unnamed: 0,Ticker,Date,SimFinId,Open,High,Low,Close,Adj. Close,Volume,Dividend,...,Non-Operating Income (Loss),"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Extraordinary Gains (Losses),Net Income,Net Income (Common)
4297154,CRWD,2019-06-12,1039026,63.5,67.0,56.0,58.0,58.0,19449162,,...,6283000.0,-442000.0,-139782000.0,,-139782000.0,-1997000.0,-141779000,,-141779000,-141779000
4297155,CRWD,2019-06-12,1039026,63.5,67.0,56.0,58.0,58.0,19449162,,...,4660000.0,3409000.0,-87869000.0,,-87869000.0,-4760000.0,-92629000,,-92629000,-92629000
4297156,CRWD,2019-06-12,1039026,63.5,67.0,56.0,58.0,58.0,19449162,,...,-17475000.0,-21443000.0,-160023000.0,,-160023000.0,-72355000.0,-232378000,,-234802000,-234802000
4297157,CRWD,2019-06-12,1039026,63.5,67.0,56.0,58.0,58.0,19449162,,...,30229000.0,27176000.0,-159883000.0,,-159883000.0,-22402000.0,-182285000,,-183245000,-183245000
4297158,CRWD,2019-06-12,1039026,63.5,67.0,56.0,58.0,58.0,19449162,,...,124812000.0,123174000.0,122817000.0,,122817000.0,-32232000.0,90585000,,89327000,89327000


In [7]:
df_merged.describe()

Unnamed: 0,SimFinId,Open,High,Low,Close,Adj. Close,Volume,Dividend,Shares Outstanding,Fiscal Year,...,Non-Operating Income (Loss),"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Extraordinary Gains (Losses),Net Income,Net Income (Common)
count,18996000.0,18996000.0,18996000.0,18996000.0,18996000.0,18996000.0,18996000.0,106554.0,18314020.0,18996000.0,...,18774500.0,16672100.0,18993130.0,12269190.0,18993520.0,15529530.0,18996000.0,1872766.0,18996000.0,18996000.0
mean,4209057.0,748.4808,772.989,715.6847,740.8929,738.9101,1972607.0,0.451973,278580300.0,2020.939,...,-138480100.0,-221851900.0,538993900.0,-26930530.0,440517600.0,-120421600.0,425954200.0,113857500.0,424337500.0,420869000.0
std,5109715.0,53727.95,55523.22,50854.31,52910.14,52910.16,30433400.0,1.252695,4835946000.0,1.381633,...,9954461000.0,12607880000.0,25548470000.0,4453758000.0,27368590000.0,8085709000.0,19794430000.0,919918500.0,19873300000.0,19872850000.0
min,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2019.0,...,-1007982000000.0,-1381290000000.0,-1574245000000.0,-31769000000.0,-1574245000000.0,-505456000000.0,-1394512000000.0,-4306000000.0,-1409383000000.0,-1409383000000.0
25%,422040.0,8.95,9.2,8.68,8.94,8.48,75112.5,0.13,21586000.0,2020.0,...,-51971000.0,-73844000.0,-27946000.0,-42806000.0,-38602000.0,-58795000.0,-38627000.0,-8467000.0,-38121000.0,-39354500.0
50%,995248.0,24.3,24.84,23.74,24.27,23.0,342601.0,0.26,53296500.0,2021.0,...,-4818000.0,-10826000.0,11395000.0,-4664000.0,6014000.0,-5553000.0,4758000.0,2000.0,4771000.0,4055000.0
75%,9075984.0,63.44,64.57,62.27,63.4,60.33,1145300.0,0.5,141086200.0,2022.0,...,225000.0,-243929.0,237991000.0,684000.0,212952000.0,-22475.0,175631000.0,24000000.0,172365000.0,170600000.0
max,18482570.0,16320000.0,16320000.0,15424000.0,15680000.0,15680000.0,18489980000.0,68.06,475266600000.0,2023.0,...,425846000000.0,201704000000.0,2710835000000.0,280999000000.0,2710835000000.0,736601000000.0,2205379000000.0,21827000000.0,2201565000000.0,2201565000000.0


In [8]:
df_merged.isna().mean() # 99% of the dividend column is null for all tickers 

Ticker                                      0.000000
Date                                        0.000000
SimFinId                                    0.000000
Open                                        0.000000
High                                        0.000000
Low                                         0.000000
Close                                       0.000000
Adj. Close                                  0.000000
Volume                                      0.000000
Dividend                                    0.994391
Shares Outstanding                          0.035901
Currency                                    0.000000
Fiscal Year                                 0.000000
Fiscal Period                               0.000000
Publish Date                                0.000000
Restated Date                               0.000000
Shares (Basic)                              0.005773
Shares (Diluted)                            0.011968
Revenue                                     0.

# Data Cleaning #

✅ Converted 'Date' into a Datetime Object

✅ Remove duplicate rows if needed

✅ Handled missing values


✅ Drop dividend column (with 99% of nulls)

✅ Shares outstanding -  Replace Nulls with the Latest Value Reported per Ticker 

✅ Shares outstanding -  Replace remaining nulls (for those tickers who have never reported a shares outstanding value) with 0 to preserve data integrity 

✅ Drop Columns with Too Many Missing Values (>50%)

✅ Fill Numerical Columns with Mean/Median imputation (e.g. 'Revenue', 'Cost of Revenue', 'Gross Profit', 'Operating Expenses')

✅ Fill Categorical Columns with Mode ('Currency', 'Ticker', 'Fiscal Period')

✅ Drop Rows with Too Many Nulls
If some rows still have excessive missing values, drop them.

✅ Interpolation for Time-Series Data

❌ Outliers - not needed for cleaning. 

Changing Date into a Datetime Object

In [9]:
df_merged['Date'] = pd.to_datetime(df_merged['Date']) # converting the "Date" column to a Datetime Object 

Checking for Duplicates

In [10]:
df_merged.duplicated().sum()  # Count duplicates


0

Dropping the "Divident" column with 99% of nulls 

In [12]:
df_merged = df_merged.drop(columns=["Dividend"])

Handling the null values in "Shares Outstanding" 

In [15]:
# Sort the DataFrame by Ticker and Date in ascending order
df_merged = df_merged.sort_values(by=["Ticker", "Date"])

# Get the latest (most recent) non-null "Shares Outstanding" per Ticker
latest_shares_outstanding = df_merged.groupby("Ticker")["Shares Outstanding"].last()

# Fill missing values using the latest reported value per Ticker
df_merged["Shares Outstanding"] = df_merged["Shares Outstanding"].fillna(df_merged["Ticker"].map(latest_shares_outstanding))

In [16]:
df_merged.isna().mean()

Ticker                                      0.000000
Date                                        0.000000
SimFinId                                    0.000000
Open                                        0.000000
High                                        0.000000
Low                                         0.000000
Close                                       0.000000
Adj. Close                                  0.000000
Volume                                      0.000000
Shares Outstanding                          0.002579
Currency                                    0.000000
Fiscal Year                                 0.000000
Fiscal Period                               0.000000
Publish Date                                0.000000
Restated Date                               0.000000
Shares (Basic)                              0.005773
Shares (Diluted)                            0.011968
Revenue                                     0.083639
Cost of Revenue                             0.

The reason some “Shares Outstanding” values are still null, even after replacing them with the latest available value, is that .last() in groupby("Ticker") selects only the most recent non-null value per Ticker. However, if a Ticker never had a reported non-null value, .last() returns NaN instead of a valid number. As a result, when we try to map and fill missing values, there is no valid value available for replacement, leaving some entries still null.

In [17]:
null_so =  df_merged[df_merged["Shares Outstanding"].isna()]

null_so

Unnamed: 0,Ticker,Date,SimFinId,Open,High,Low,Close,Adj. Close,Volume,Shares Outstanding,...,Non-Operating Income (Loss),"Interest Expense, Net","Pretax Income (Loss), Adj.",Abnormal Gains (Losses),Pretax Income (Loss),"Income Tax (Expense) Benefit, Net",Income (Loss) from Continuing Operations,Net Extraordinary Gains (Losses),Net Income,Net Income (Common)
396145,ADRA,2019-04-15,11817231,32.16,32.16,32.16,32.16,31.43,86,,...,4307581.0,10281.0,3330750.0,-86544.0,3244206.0,,3244206,,3244206,3244206
396146,ADRA,2019-04-15,11817231,32.16,32.16,32.16,32.16,31.43,86,,...,-4056000.0,-4056000.0,47697000.0,-9656000.0,38041000.0,-9423000.0,28618000,,28618000,28618000
396147,ADRA,2019-04-15,11817231,32.16,32.16,32.16,32.16,31.43,86,,...,-11715000.0,-11715000.0,-36308000.0,-8154000.0,-44462000.0,9058000.0,-35404000,,-35404000,-35404000
396148,ADRA,2019-04-16,11817231,32.22,32.22,32.22,32.22,31.48,237,,...,4307581.0,10281.0,3330750.0,-86544.0,3244206.0,,3244206,,3244206,3244206
396149,ADRA,2019-04-16,11817231,32.22,32.22,32.22,32.22,31.48,237,,...,-4056000.0,-4056000.0,47697000.0,-9656000.0,38041000.0,-9423000.0,28618000,,28618000,28618000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18108553,VTOL,2024-03-14,10656416,25.83,26.00,25.53,25.76,25.76,75947,,...,-4593000.0,-41360000.0,28154000.0,-32573000.0,-4419000.0,-11294000.0,-15713000,,-15791000,-15791000
18108554,VTOL,2024-03-14,10656416,25.83,26.00,25.53,25.76,25.76,75947,,...,-35574000.0,-32771000.0,19101000.0,-1089000.0,18012000.0,-24932000.0,-6920000,,-6780000,-6780000
18108555,VTOL,2024-03-15,10656416,25.70,26.28,25.65,25.75,25.75,280498,,...,-22045000.0,-49966000.0,42707000.0,-99347000.0,-56640000.0,355000.0,-56285000,,-56094000,-56094000
18108556,VTOL,2024-03-15,10656416,25.70,26.28,25.65,25.75,25.75,280498,,...,-4593000.0,-41360000.0,28154000.0,-32573000.0,-4419000.0,-11294000.0,-15713000,,-15791000,-15791000


Replacing the Shares Outstanding Values that have never had a reported non-null vallue with 0. 

In [18]:
df_merged["Shares Outstanding"] = df_merged["Shares Outstanding"].fillna(0) # replacing the remaining "Shares Outstanding" with 0

In [21]:
df_merged.columns

Index(['Ticker', 'Date', 'SimFinId', 'Open', 'High', 'Low', 'Close',
       'Adj. Close', 'Volume', 'Shares Outstanding', 'Currency', 'Fiscal Year',
       'Fiscal Period', 'Publish Date', 'Restated Date', 'Shares (Basic)',
       'Shares (Diluted)', 'Revenue', 'Cost of Revenue', 'Gross Profit',
       'Operating Expenses', 'Selling, General & Administrative',
       'Research & Development', 'Depreciation & Amortization',
       'Operating Income (Loss)', 'Non-Operating Income (Loss)',
       'Interest Expense, Net', 'Pretax Income (Loss), Adj.',
       'Abnormal Gains (Losses)', 'Pretax Income (Loss)',
       'Income Tax (Expense) Benefit, Net',
       'Income (Loss) from Continuing Operations',
       'Net Extraordinary Gains (Losses)', 'Net Income',
       'Net Income (Common)'],
      dtype='object')

In [20]:
null_means = df_merged.isna().mean()
for column, mean in null_means.items():
    if mean > 0:
        print(f"{column}: {mean:.6f}")


Shares (Basic): 0.005773
Shares (Diluted): 0.011968
Revenue: 0.083639
Cost of Revenue: 0.195632
Gross Profit: 0.195577
Operating Expenses: 0.001986
Selling, General & Administrative: 0.054266
Research & Development: 0.539089
Depreciation & Amortization: 0.584403
Operating Income (Loss): 0.000151
Non-Operating Income (Loss): 0.011660
Interest Expense, Net: 0.122336
Pretax Income (Loss), Adj.: 0.000151
Abnormal Gains (Losses): 0.354117
Pretax Income (Loss): 0.000130
Income Tax (Expense) Benefit, Net: 0.182484
Net Extraordinary Gains (Losses): 0.901413


Drop Columns with Too Many Missing Values (>50%)
If a column has too many missing values and isn't critical, it's best to drop it.

columns dropped: Research & Development, Depreciation & Amortization

In [22]:
threshold = 0.5  # 50% missing values threshold
df_merged = df_merged.dropna(axis=1, thresh=len(df_merged) * (1 - threshold))

Fill Numerical Columns with Mean/Median
For numerical data like Revenue, Gross Profit, etc., use median imputation.
Used median imputation instead of mean because median is more robust to outliers

In [24]:
num_cols = ['Revenue', 'Cost of Revenue', 'Gross Profit', 'Operating Expenses']
for col in num_cols:
    df_merged[col].fillna(df_merged[col].median(), inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_merged[col].fillna(df_merged[col].median(), inplace=True)


Fill Categorical Columns with Mode
For categorical columns like Currency, Ticker, or Fiscal Period.

- Categorical columns like Currency, Ticker, and Fiscal Period contain discrete values.
- The mode is the most frequently occurring category, making it a logical replacement for missing values.
- If we later use one-hot encoding or label encoding, missing values could create problems.
- Filling with the mode ensures that all rows have valid categorical values.

In [25]:
cat_cols = ['Currency', 'Ticker', 'Fiscal Period']
for col in cat_cols:
    df_merged[col].fillna(df_merged[col].mode()[0], inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_merged[col].fillna(df_merged[col].mode()[0], inplace=True)


Drop Rows with Too Many Nulls
If some rows still have excessive missing values, drop them.

In [26]:
df_merged.dropna(thresh=len(df_merged.columns) * 0.7, inplace=True)  # Drops rows with >30% missing


Interpolation for Time-Series Data

Uses linear interpolation to estimate missing values between known values.
Works by computing a straight-line equation between two known points and filling missing values accordingly.
- If your dataset follows a time pattern (e.g., financial data), use interpolation.
- For time-series financial data (e.g., stock prices, revenue), missing values often occur due to holidays, reporting delays, or system gaps.
- Interpolation helps smooth data trends rather than using arbitrary imputation like mean/median.
- It ensures that missing values follow the trend instead of creating unnatural spikes.

In [28]:
df_merged = df_merged.sort_values('Date')  # Ensure sorting before interpolation
df_merged.interpolate(method='linear', inplace=True)


  df_merged.interpolate(method='linear', inplace=True)


In [None]:
df_merged.isna().mean()

Ticker                                      0.000000e+00
Date                                        0.000000e+00
SimFinId                                    0.000000e+00
Open                                        0.000000e+00
High                                        0.000000e+00
Low                                         0.000000e+00
Close                                       0.000000e+00
Adj. Close                                  0.000000e+00
Volume                                      0.000000e+00
Shares Outstanding                          0.000000e+00
Currency                                    0.000000e+00
Fiscal Year                                 0.000000e+00
Fiscal Period                               0.000000e+00
Publish Date                                0.000000e+00
Restated Date                               0.000000e+00
Shares (Basic)                              0.000000e+00
Shares (Diluted)                            0.000000e+00
Revenue                        

In [None]:
# Define the file path
#file_path = "/Users/enricotajanlangit/Desktop/pre-processed.csv"

# Save the DataFrame as a CSV file
#df_merged.to_csv(file_path, index=False)

#print(f"File saved successfully at: {file_path}")

OSError: [Errno 28] No space left on device