# Financial Predictive Modeling Project

## Project Summary
The objective of this project is to build a predictive model that estimates key financial metrics based on historical and synthetic data. By leveraging structured data, feature engineering, and regression modeling, the project aims to provide actionable insights for decision-makers. This model is intended to identify trends, support scenario analysis, and improve the accuracy of financial forecasts in a controlled environment.

Accurate predictions are critical for stakeholders who rely on timely insights to make investment or operational decisions. Even small improvements in predictive accuracy can translate into better risk management, resource allocation, and strategic planning. By creating a robust and reproducible pipeline, this project ensures stakeholders can trust and act upon the results consistently.

## Stakeholder Persona / Context
| Stakeholder       | Role / Needs | Interests & Concerns |
|------------------|-------------|--------------------|
| Investment Analyst | Uses model outputs for portfolio strategy | Accuracy, timely delivery, interpretability |
| Risk Manager       | Monitors potential exposure | Model reliability, error tracking, transparency |
| Team Lead / PM     | Oversees project progress | Reproducibility, workflow clarity, deployment readiness |

**Context Description:**  
The primary users of this project are analysts and managers within a financial team who need clear, interpretable, and timely predictions. They care about accuracy, reliability, and the ability to audit and reproduce results. Ensuring a smooth workflow and clear communication of outputs is crucial for their decision-making.

## Python Fundamentals Summary

In [1]:
import pandas as pd
import numpy as np
import sys
from pathlib import Path

project_root = Path().resolve().parent
sys.path.append(str(project_root))
from src.utils import parse_date_column, convert_column_type, fill_missing_values


data = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'age': [25, np.nan, 35, 40],
    'salary': [70000, 80000, np.nan, 90000],
    'join_date': ['2023-01-15', '2022-11-30', '2023-03-20', '2022-03-20']
})

data

Unnamed: 0,name,age,salary,join_date
0,Alice,25.0,70000.0,2023-01-15
1,Bob,,80000.0,2022-11-30
2,Charlie,35.0,,2023-03-20
3,David,40.0,90000.0,2022-03-20


In [2]:
# Convert age and salary columns to float
data = convert_column_type(data, column='age', dtype='int')
data = convert_column_type(data, column='salary', dtype='float')
data['join_date'] = pd.to_datetime(data['join_date'], infer_datetime_format=True, errors='coerce')
data

  data['join_date'] = pd.to_datetime(data['join_date'], infer_datetime_format=True, errors='coerce')


Unnamed: 0,name,age,salary,join_date
0,Alice,25.0,70000.0,2023-01-15
1,Bob,,80000.0,2022-11-30
2,Charlie,35.0,,2023-03-20
3,David,40.0,90000.0,2022-03-20


In [3]:
# Fill missing values
data = fill_missing_values(data, strategy='mean')
data

Unnamed: 0,name,age,salary,join_date
0,Alice,25.0,70000.0,2023-01-15
1,Bob,33.333333,80000.0,2022-11-30
2,Charlie,35.0,80000.0,2023-03-20
3,David,40.0,90000.0,2022-03-20


In [4]:
# Basic NumPy operations
ages = data['age'].to_numpy()
mean_age = np.mean(ages)
print("Mean age:", mean_age)

salaries = data['salary'].to_numpy()
total_salary = np.sum(salaries)
print("Total salary:", total_salary)

# Basic pandas operations
# Add a new calculated column
data['salary_in_k'] = data['salary'] / 1000
print("\nDataFrame with new column 'salary_in_k':")
print(data)

# Filter rows
high_salary = data[data['salary'] > 75000]
print("\nFiltered DataFrame (salary > 75k):")
print(high_salary)

Mean age: 33.333333333333336
Total salary: 320000.0

DataFrame with new column 'salary_in_k':
      name        age   salary  join_date  salary_in_k
0    Alice  25.000000  70000.0 2023-01-15         70.0
1      Bob  33.333333  80000.0 2022-11-30         80.0
2  Charlie  35.000000  80000.0 2023-03-20         80.0
3    David  40.000000  90000.0 2022-03-20         90.0

Filtered DataFrame (salary > 75k):
      name        age   salary  join_date  salary_in_k
1      Bob  33.333333  80000.0 2022-11-30         80.0
2  Charlie  35.000000  80000.0 2023-03-20         80.0
3    David  40.000000  90000.0 2022-03-20         90.0


## Data Acquisition and Ingestion

In [5]:
from src.data_ingestion import fetch_sample_data
fetch_sample_data()

Data saved to ../data/raw/ibm_daily_raw.csv


## Data Storage(preview)

In [7]:
import os
import pandas as pd
from dotenv import load_dotenv

load_dotenv()
DATA_PATH = os.getenv("DATA_PATH", "../data/raw/")
df = pd.read_csv(os.path.join(DATA_PATH, "ibm_daily_raw.csv"))
print(DATA_PATH)
print(df)

../data/raw/
          date  1. open   2. high    3. low  4. close  5. volume
0   2025-08-29  245.230  245.4599  241.7200    243.49    2967558
1   2025-08-28  245.430  245.8800  243.3600    245.73    2820817
2   2025-08-27  242.870  245.9600  242.0000    244.84    3698372
3   2025-08-26  241.020  244.9800  240.3800    242.63    5386582
4   2025-08-25  242.565  242.5650  239.4300    239.43    3513327
..         ...      ...       ...       ...       ...        ...
95  2025-04-14  239.770  241.7700  236.7300    239.06    3321717
96  2025-04-11  229.720  237.5800  227.5100    235.48    4325895
97  2025-04-10  231.000  232.5700  222.0200    229.55    5656108
98  2025-04-09  217.120  236.3000  215.1636    235.31    7302808
99  2025-04-08  232.560  233.0500  217.2800    221.03    6849996

[100 rows x 6 columns]


## Data Preprocessing

In [14]:
import pandas as pd
from src.cleaning import drop_missing_threshold, fill_missing, standardize_column_names

# load data
df = pd.read_csv("../data/raw/ibm_daily_raw.csv")

# cleaning
df = drop_missing_threshold(df, threshold=0.4)
df = fill_missing(df, "1. open", method="median")

# save
df.to_csv("../data/processed/ibm_daily_clean.csv", index=False)