# ML Web App using Flask

## Importing the necessary libraries

In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import joblib
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
import os

## Load and Prepare the Dataset

> In this step, I loaded a historical dataset of Amazon (AMZN) stock prices from Yahoo Finance. 
- I downloaded this dataset from this URL: ***https://www.kaggle.com/datasets/adilshamim8/amazon-stock-price-history***
- This dataset contains daily trading data with columns such as Date, Open, High, Low, Close, and Volume. 
- These are common features in time series financial data and will help us analyze the stock's historical behavior and eventually forecast its future prices. 

In [6]:
# Load dataset
df = pd.read_csv("/workspaces/ginotomasd-mlwebappflask/data/Amazon_stock_data.csv")

# Show first few records to understand the structure
print(df.head())

# Check dataset info to ensure no missing values and understand data types
df.info()

         Date     Close      High       Low      Open      Volume
0  1997-05-15  0.097917  0.125000  0.096354  0.121875  1443120000
1  1997-05-16  0.086458  0.098958  0.085417  0.098438   294000000
2  1997-05-19  0.085417  0.088542  0.081250  0.088021   122136000
3  1997-05-20  0.081771  0.087500  0.081771  0.086458   109344000
4  1997-05-21  0.071354  0.082292  0.068750  0.081771   377064000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7086 entries, 0 to 7085
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Date    7086 non-null   object 
 1   Close   7086 non-null   float64
 2   High    7086 non-null   float64
 3   Low     7086 non-null   float64
 4   Open    7086 non-null   float64
 5   Volume  7086 non-null   int64  
dtypes: float64(4), int64(1), object(1)
memory usage: 332.3+ KB


> In this step, I loaded the historical stock price dataset for Amazon, which includes 7,086 trading days. Each row corresponds to a single day of trading data, starting from May 15, 1997. The dataset contains the following columns:

- **Date**: The trading day (currently as a string).
- **Open**: The price at which the stock opened.
- **High**: The highest price reached that day.
- **Low**: The lowest price reached that day.
- **Close**: The price at market close.
- **Volume**: The number of shares traded.

> All numeric values are in float format except for Volume (integer), and the Date column will later be converted to datetime format for time series processing. This dataset looks clean, complete, and ready for further exploration.


## Preprocessing the Data

> In this step, I’m preparing the dataset for modeling. Since this is a time series forecasting task, I need to make sure the data is sorted by date and that the `Date` column is properly converted to datetime format.

> I’ll also set it as the index of the DataFrame for easier manipulation.

- To make a simple but effective model, I’ll try to predict the next day’s **Close** price using today's features like `Open`, `High`, `Low`, `Close`, and `Volume`. 

- To do that, I’ll shift the `Close` column one step backward to become our target variable (`Next_Close`). 
-Then, I’ll drop any resulting `NaN` values after the shift and split the data into train and test sets in chronological order to avoid data leakage.


In [7]:
# Convert 'Date' to datetime and sort by date
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values("Date")

# Set the date as index (optional but useful)
df.set_index("Date", inplace=True)

# Create the target variable: next day's close price
df["Next_Close"] = df["Close"].shift(-1)

# Drop the last row (NaN in target)
df.dropna(inplace=True)

# Define features and target
X = df[["Open", "High", "Low", "Close", "Volume"]]
y = df["Next_Close"]

# Split into train and test sets (80% train, 20% test), keeping time order
split_index = int(len(df) * 0.8)
X_train, X_test = X.iloc[:split_index], X.iloc[split_index:]
y_train, y_test = y.iloc[:split_index], y.iloc[split_index:]

# Confirm the shapes
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((5668, 5), (1417, 5), (5668,), (1417,))

> After preparing the dataset, I ended up with 5 features: `Open`, `High`, `Low`, `Close`, and `Volume`. I used these to predict the next day’s `Close` price by shifting that column one day backward.

> Then, I split the data into training and testing sets, keeping the time order to avoid leakage (no random shuffling). The result is:

- `X_train`: 5,668 rows × 5 features  
- `X_test`: 1,417 rows × 5 features  
- `y_train`: 5,668 target values  
- `y_test`: 1,417 target values

> Now I’m ready to train a regression model to forecast Amazon’s next-day closing stock price.


## Model Development

> To predict Amazon's next-day closing price, I decided to start with a simple Linear Regression model. 
- Linear regression is a great first step because it's easy to interpret and surprisingly effective for time series with strong trends or correlations.

> I’ll fit the model using the training data and evaluate it using common regression metrics on the test set: MAE, MSE, and R². 
- These metrics help me understand how far off my predictions are on average, and how well the model explains the variance in the data.


In [9]:
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the evaluation results
print("Mean Absolute Error (MAE):", round(mae, 4))
print("Mean Squared Error (MSE):", round(mse, 4))
print("R-squared (R²):", round(r2, 4))

Mean Absolute Error (MAE): 2.3557
Mean Squared Error (MSE): 10.5792
R-squared (R²): 0.9921


> After training my Linear Regression model on the historical Amazon stock data, I evaluated its performance using the test set. The results look decent, but I don't expect anything out of the ordinary:

- **Mean Absolute Error (MAE): 2.36** – On average, my predictions are off by about $2.36, which is relatively low considering the range of Amazon's stock prices.
- **Mean Squared Error (MSE): 10.58** – Squaring the errors penalizes larger mistakes, and this value confirms that my model is fairly consistent.
- **R-squared (R²): 0.9921** – This means that 99.21% of the variation in the stock prices is explained by the model. That's an excellent score, suggesting a strong correlation between the input features and the next day’s closing price.

> Overall, I’m satisfied with this initial model and will now move on to try to improve it a little bit more. Just for fun.


## Model Optimization

> In this step, I’m going to try improving the initial Linear Regression model by using **Ridge Regression**, which adds L2 regularization to reduce overfitting and improve generalization. 
- I’ll use `GridSearchCV` to tune the regularization strength parameter `alpha`.

> This approach helps me find the best-performing version of Ridge Regression without manually testing multiple configurations. 
- Once I find the best model, I’ll compare its results to the original Linear Regression.


In [12]:
# Define the parameter grid for Ridge
param_grid = {'alpha': [0.01, 0.1, 1, 10, 100, 1000]}

# Initialize Ridge model
ridge = Ridge()

# Use GridSearchCV to find best alpha
grid = GridSearchCV(ridge, param_grid, cv=5, scoring='neg_mean_squared_error')
grid.fit(X_train, y_train)

# Use best model to predict
y_pred_ridge = grid.best_estimator_.predict(X_test)

# Evaluate
mae = mean_absolute_error(y_test, y_pred_ridge)
mse = mean_squared_error(y_test, y_pred_ridge)
r2 = r2_score(y_test, y_pred_ridge)

print("Best Alpha:", grid.best_params_['alpha'])
print("Mean Absolute Error (MAE):", round(mae, 4))
print("Mean Squared Error (MSE):", round(mse, 4))
print("R-squared (R²):", round(r2, 4))

Best Alpha: 1
Mean Absolute Error (MAE): 2.3556
Mean Squared Error (MSE): 10.5771
R-squared (R²): 0.9921


  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(*arrays, *other_args, **kwargs)
  return f(

> I just optimized my model using Ridge Regression and cross-validation through `GridSearchCV`. The best alpha value turned out to be **1**, which gives just the right amount of regularization to help reduce overfitting.

> The performance of this Ridge model is almost identical to my original Linear Regression model:

- **MAE:** 2.3556  
- **MSE:** 10.5771  
- **R²:** 0.9921

> This tells me that my initial model was already extremely accurate, and Ridge only made a tiny difference. That’s actually a good sign—it means the model is already good (it isnt)

> I also got a few warnings (`LinAlgWarning: Ill-conditioned matrix`) during the cross-validation. These are related to numerical instability in the matrix inversion step, probably due to very small feature variances or strong feature correlations. 
- Since the performance is solid and consistent, I’m not too worried right now, but it’s something to keep in mind if I scale this up later.

> (I dont believe that with only this information you can have such accurate predictions.)


## Saving the model

> Now I'll save the model, so I can then proceed to prepare everything with Flask and then upload it to Render. 

In [16]:
# Ensure models directory exists
os.makedirs('../models', exist_ok=True)

# Save model and scaler
joblib.dump(model, '../models/ridge_model.pkl')

['../models/ridge_model.pkl']

> Now let's go to app.py to configure the Flask code.