# 04 Credit Sentiment (HYS) — Cleaning & Export

Goal: Clean credit sentiment (yearly) data, create `date_key` (Jan-01 of year), export `investor_credit_sentiment_cleaned.csv`.

Guardrails: work on copies, no inplace, explicit conversions, validate early, conservative cleaning, document decisions.


What: Import libraries and constants.
Why: Keep environment consistent and configuration explicit.


In [1]:
import os
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 160)

ANALYSIS_START_YEAR = 2000
ANALYSIS_END_YEAR = 2024

ROOT = os.path.abspath(os.path.join(os.getcwd(), "..")) if os.path.basename(os.getcwd()) == "notebooks" else os.getcwd()
RAW = os.path.join(ROOT, "original_data")
CLEAN = os.path.join(ROOT, "data", "cleaned")
os.makedirs(CLEAN, exist_ok=True)


What: Load Excel into a working copy and inspect.
Why: Confirm schema and spot data quality issues early.


In [2]:
# Load
raw_path = os.path.join(RAW, "InvestorCreditSentiment.xlsx")
df_raw = pd.read_excel(raw_path)
df = df_raw.copy()
print('Shape:', df.shape)
print(df.head())
print(df.info())


Shape: (90, 3)
   year       HYS  Notes
0  1926  0.182164    NaN
1  1927  0.176656    NaN
2  1928  0.269542    NaN
3  1929  0.261887    NaN
4  1930  0.134916    NaN
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   year    90 non-null     int64  
 1   HYS     90 non-null     float64
 2   Notes   0 non-null      float64
dtypes: float64(2), int64(1)
memory usage: 2.2 KB
None


What: Detect year; restrict to scope; create date_key; coerce numerics.
Why: Standardize grain and types for reliable joins.


In [3]:
# Detect year column and coerce numerics
possible_year_cols = [c for c in df.columns if c.strip().lower() in ('year',) or 'year' in c.lower()]
assert possible_year_cols, 'No year column detected; set manually.'
YEAR_COL = possible_year_cols[0]

y = pd.to_numeric(df[YEAR_COL], errors='coerce').astype('Int64')
mask = (y >= ANALYSIS_START_YEAR) & (y <= ANALYSIS_END_YEAR)

fact = df.copy()
fact[YEAR_COL] = y
fact = fact.loc[mask].copy()

# Create date_key (Jan-01 of each year)
fact['date_key'] = pd.to_datetime(fact[YEAR_COL].astype(str) + '-01-01', errors='coerce').dt.strftime('%Y%m%d').astype(int)

# Coerce numeric metrics
for c in fact.columns:
    if c not in (YEAR_COL, 'date_key'):
        fact[c] = pd.to_numeric(fact[c], errors='coerce')

# Tidy columns
fact.columns = [str(c).strip().lower().replace(' ', '_') for c in fact.columns]

out_path = os.path.join(CLEAN, 'investor_credit_sentiment_cleaned.csv')
fact.to_csv(out_path, index=False, encoding='utf-8')
print('Wrote:', out_path)


Wrote: G:\ACADEMIA\VA 5122\Final Project\phase1_cleaning_preprocessing\data\cleaned\investor_credit_sentiment_cleaned.csv


What: Export fact CSV (Power BI ready).
Why: Provide clean, tidy data with keys for modeling.


In [4]:
# Validation
print('Shape:', fact.shape)
print('Year min/max:', fact['year'].min() if 'year' in fact.columns else 'n/a', fact['year'].max() if 'year' in fact.columns else 'n/a')
print('Min/Max date_key:', fact['date_key'].min(), fact['date_key'].max())
print('Nulls (top 10):')
print(fact.isna().sum().sort_values(ascending=False).head(10))


Shape: (16, 4)
Year min/max: 2000 2015
Min/Max date_key: 20000101 20150101
Nulls (top 10):
notes       16
year         0
hys          0
date_key     0
dtype: int64


What: Validate shape, years, date_key, and nulls.
Why: Ensure readiness and avoid surprises in visuals.


## Decisions & Notes

- Year column detected and restricted to 2000–2024.
- `date_key` generated as YYYYMMDD for Jan-01 of each year.
- Numeric coercions applied with `errors='coerce'`; document any imputations or drops here.
