# Run the Regression Model (Interactive Notebook)

This notebook loads `regression_model.py` from the repo, creates features, trains a quick demo RandomForest, and shows inline evaluation and plots.

Place this notebook in the `statements` folder (it has been saved there). Run the cells in order. If dependencies are missing, run the optional install cell.

In [1]:
# 1) Ensure notebook working directory is the project statements folder
import os
path = r"c:\Users\udugr\OneDrive\Desktop\mine\Predictive Analytics - Copy\First semester\Database course\Database & Programming Essentials 4\Assignment4\statements"
os.chdir(path)
print('CWD:', os.getcwd())

CWD: c:\Users\udugr\OneDrive\Desktop\mine\Predictive Analytics - Copy\First semester\Database course\Database & Programming Essentials 4\Assignment4\statements


In [2]:
# 2) Optional: install dependencies (run once if needed)
# Uncomment the next line to install from the project's requirements file
# %pip install -r requirements_ui.txt
# or install common packages individually:
# %pip install pandas numpy scikit-learn matplotlib seaborn plotly joblib

In [3]:
# 3) Configure inline plotting for Matplotlib and Plotly
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 6)
import plotly.io as pio
pio.renderers.default = 'notebook'  # optional

In [5]:
# 4) Import the TransactionPredictor class and load data (fast)
from regression_model import TransactionPredictor
predictor = TransactionPredictor()
df = predictor.load_and_prepare_data()  # loads JSON and removes PAYMENT rows
print('\nLoaded dataframe shape:', df.shape)
df.head()

Loading transaction data...
Loaded 67 transactions
After removing payments: 62 transactions

Loaded dataframe shape: (62, 6)


Unnamed: 0,transaction_date,transaction_postdate,transaction_description,spend_category,amount,source_file
5,2020-08-17,2020-08-18,MY SPICE HOUSE WINNIPEG MB,Retail and Grocery,11.0,onlineStatement (13).pdf
6,2020-08-17,2020-08-18,REAL CDN. SUPERSTORE # WINNIPEG MB,Retail and Grocery,22.37,onlineStatement (13).pdf
7,2020-08-20,2020-08-21,MPI BISON SERVICE CENTRE WINNIPEG MB,Professional and Financial Services,25.0,onlineStatement (13).pdf
8,2020-08-20,2020-08-24,SOBEYS #5037 WINNIPEG MB,Retail and Grocery,15.76,onlineStatement (13).pdf
9,2020-08-22,2020-08-24,TIM HORTONS #8152 WINNIPEG MB,Restaurants,1.98,onlineStatement (13).pdf


In [None]:
# 5) Create features and prepare ML data
predictor.create_features()
X, y = predictor.prepare_ml_data()
print('\nFeatures shape (X):', X.shape)
print('Target shape (y):', y.shape)
# show first 40 feature names
list(X.columns)[:40]

In [None]:
# 6) Quick demo train: train a small RandomForest and show evaluation & plots (fast)
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=50, random_state=42)
rf.fit(predictor.X_train, predictor.y_train)
# attach to predictor for helper methods
predictor.models['Quick RF'] = rf
predictor.best_model = rf
# show feature importances if available
try:
    import pandas as pd
    fi = rf.feature_importances_
    feat_df = pd.DataFrame({'feature': predictor.X_train.columns, 'importance': fi}).sort_values('importance', ascending=False).head(20)
    display(feat_df)
except Exception as e:
    print('Feature importance not available or error:', e)
# evaluation
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
y_pred = rf.predict(predictor.X_test)
print('R2:', r2_score(predictor.y_test, y_pred))
print('RMSE:', mean_squared_error(predictor.y_test, y_pred, squared=False))
print('MAE:', mean_absolute_error(predictor.y_test, y_pred))
# inline visualizations using the helper (matplotlib inline enabled)
predictor.create_visualizations()

In [None]:
# 7) Optional: run the full pipeline (can be slow because of CV/GridSearch).
# Uncomment if you want the complete analysis including GridSearchCV tuning.
# predictor.run_complete_analysis()
# Alternatively run heavy steps separately:
# model_scores = predictor.train_models()
# cv_scores = predictor.prevent_overfitting()
# model_performance = predictor.select_best_model()
# predictor.analyze_feature_importance()
# predictor.create_visualizations()

## Tips
- To save the selected model for later use: `import joblib; joblib.dump(predictor.best_model, 'best_reg_model.joblib')`
- To run the Streamlit UI, open a terminal and run: `streamlit run model_ui.py` and open `http://localhost:8501`.
- If plots do not display inline, ensure `%matplotlib inline` is set in a cell (Cell 3).

If you'd like, I can also: create a notebook with more explanation cells, modify `regression_model.py` to save PNGs instead of `plt.show()`, or add a cell that saves the trained model. Tell me which you'd like next.