# Stock Price Prediction — Amazon (AMZN)
## Notebook 2 — Feature Engineering & Target Creation

### Objective
The purpose of this notebook is to:
1. Load the cleaned dataset (which intelligently includes S&P 500 data).

2. Create technical indicators (Trend, Momentum, Volatility).

3. Create lag variables to give the model a "memory".

4. Define the Target variable and analyze its distribution.

5. Clean the final dataset (handling expected NaNs) and save it.

### Why this matters
A machine learning model cannot predict the future from just a raw price. By calculating Moving Averages, RSI, and Volatility, we provide the model with "clues".

In [38]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

### DATA LOADING

In [39]:
df = pd.read_csv("../data/raw/amzn_sp500_clean.csv", index_col=0, parse_dates=True)

print(f"Données chargées : {df.shape[0]} lignes et {df.shape[1]} colonnes.")
df.head()

Données chargées : 5031 lignes et 10 colonnes.


Unnamed: 0_level_0,Adj Close,Close,High,Low,Open,Volume,Price,Return,SP500,SP500_Return
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2005-01-05,2.0885,2.0885,2.138,2.078,2.0785,167084000,2.0885,-0.00878,1183.73999,-0.003628
2005-01-06,2.0525,2.0525,2.1125,2.045,2.0905,174018000,2.0525,-0.017237,1187.890015,0.003506
2005-01-07,2.116,2.116,2.1345,2.058,2.069,196732000,2.116,0.030938,1186.189941,-0.001431
2005-01-10,2.092,2.092,2.148,2.0855,2.097,146958000,2.092,-0.011342,1190.25,0.003423
2005-01-11,2.082,2.082,2.108,2.0505,2.07,158406000,2.082,-0.00478,1182.98999,-0.0061


### 1. Feature Engineering (Technical Indicators)

We will calculate several standard financial metrics:
- **Daily Return & Volatility (20 days)**: To measure market risk. (20 days is the financial standard representing exactly one trading month).

- **Moving Averages (MA10 & MA50)**: To smooth out noise and identify short/long-term trends.

- **RSI (Relative Strength Index)**: A momentum indicator (0 to 100) to identify overbought or oversold conditions.

- **Lags (J-1, J-2, J-3)**: Past prices to help the model detect short-term patterns.

In [47]:
# Returns & Volatility
df['Return'] = df['Adj Close'].pct_change()
df['Volatility'] = df['Return'].rolling(window=20).std()

# Moving Averages (Trend)
df['MA_10'] = df['Adj Close'].rolling(window=10).mean()
df['MA_50'] = df['Adj Close'].rolling(window=50).mean()

# RSI (Momentum)
def compute_rsi(data, window=14):
    diff = data.diff()
    gain = (diff.where(diff > 0, 0)).rolling(window=window).mean()
    loss = (-diff.where(diff < 0, 0)).rolling(window=window).mean()
    rs = gain / loss
    return 100 - (100 / (1 + rs))

df['RSI'] = compute_rsi(df['Adj Close'])

# Lags (Past Prices)
for i in range(1, 4):
    df[f'Lag_{i}'] = df['Adj Close'].shift(i)

df.head()

Unnamed: 0_level_0,Adj Close,Close,High,Low,Open,Volume,Price,Return,SP500,SP500_Return,Volatility,MA_10,MA_50,RSI,Lag_1,Lag_2,Lag_3,Target
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2005-01-05,2.0885,2.0885,2.138,2.078,2.0785,167084000,2.0885,,1183.73999,-0.003628,,,,,,,,0
2005-01-06,2.0525,2.0525,2.1125,2.045,2.0905,174018000,2.0525,-0.017237,1187.890015,0.003506,,,,,2.0885,,,1
2005-01-07,2.116,2.116,2.1345,2.058,2.069,196732000,2.116,0.030938,1186.189941,-0.001431,,,,,2.0525,2.0885,,0
2005-01-10,2.092,2.092,2.148,2.0855,2.097,146958000,2.092,-0.011342,1190.25,0.003423,,,,,2.116,2.0525,2.0885,0
2005-01-11,2.082,2.082,2.108,2.0505,2.07,158406000,2.082,-0.00478,1182.98999,-0.0061,,,,,2.092,2.116,2.0525,1


### 2. Analysis of Lags and Missing Values (NaN)

Before moving forward, let's look at what our mathematical transformations did to the dataset:

- **Lags validation**: `Lag_1` should perfectly match the `Adj Close` of the previous day. This gives our model its memory.

- **The NaN behavior at the beginning**: Indicators like `MA_50` require 50 days of historical data to compute the first average. Therefore, the first 49 rows will naturally contain `NaN` (Not a Number). Same logic applies to `MA_10`, **`Volatility` (20 days), and `RSI` (14 days). This initial data loss is expected and mathematically correct.**

- **The NaN behavior at the end**: **Because we created our Target by shifting the price backwards (`shift(-1)`), the very last row of our dataset will also be a `NaN` (since we don't know "tomorrow's" price yet).**

### 3. Target Creation & Data Cleaning

To train our machine learning model, we need to define a **Target variable** (what the model will try to predict). Predicting the exact future price is notoriously difficult in finance due to high noise. Instead, we will frame this as a **Binary Classification** problem:

- **Target = 1**: The stock price will go UP tomorrow.

- **Target = 0**: The stock price will go DOWN or remain flat tomorrow.


We use the `shift(-1)` function on the 'Adj Close' column. This mathematically brings tomorrow's price up to today's row. By doing this, the model can learn the relationship between *today's* technical indicators (RSI, Moving Averages) and *tomorrow's* market direction.

**Handling Missing Values (NaN):**
Calculating rolling metrics (like the 50-day moving average) naturally creates empty rows (NaN) for the first 49 days. Similarly, shifting the target creates a NaN on the very last day of our dataset. We must use `dropna()` to remove these rows, as Machine Learning algorithms cannot process missing values.

In [41]:
# Create the binary target variable
df['Target'] = (df['Adj Close'].shift(-1) > df['Adj Close']).astype(int)

# Drop NaN values
df_final = df.dropna().copy()


print("Rows before dropna:", len(df))
df_final = df.dropna().copy()
print("Rows after dropna:", len(df_final))
df_final.head()

Rows before dropna: 5031
Rows after dropna: 4982


Unnamed: 0_level_0,Adj Close,Close,High,Low,Open,Volume,Price,Return,SP500,SP500_Return,Volatility,MA_10,MA_50,RSI,Lag_1,Lag_2,Lag_3,Target
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2005-03-17,1.6985,1.6985,1.722,1.679,1.679,97160000,1.6985,0.002952,1190.209961,0.001801,0.010222,1.7484,1.90458,36.855663,1.6935,1.709,1.73,1
2005-03-18,1.708,1.708,1.714,1.6825,1.7075,105894000,1.708,0.005593,1189.650024,-0.00047,0.010355,1.73995,1.89697,36.855663,1.6985,1.6935,1.709,0
2005-03-21,1.6835,1.6835,1.7315,1.677,1.7105,121688000,1.6835,-0.014344,1183.780029,-0.004934,0.010547,1.7267,1.88959,29.326902,1.708,1.6985,1.6935,0
2005-03-22,1.6575,1.6575,1.697,1.6535,1.685,109814000,1.6575,-0.015444,1171.709961,-0.010196,0.010459,1.7133,1.88042,24.288826,1.6835,1.708,1.6985,1
2005-03-23,1.659,1.659,1.692,1.641,1.6545,126474000,1.659,0.000905,1172.530029,0.0007,0.009906,1.7022,1.87176,22.247159,1.6575,1.6835,1.708,0


### 4. Class Balance Analysis

Before saving our dataset, we must check the distribution of our Target variable. 

If a dataset is heavily imbalanced (example : 90% of days are "Up" days), a naive machine learning model could achieve 90% accuracy simply by guessing "Up" every single time, without actually learning anything from our technical features. Let's verify that our classes are relatively balanced to ensure the model learns meaningful patterns.

In [42]:
print("\nView of the distribution of the binary classification) : ")
distribution = df_final['Target'].value_counts(normalize=True).round(3) * 100
print(f"- Up days (1): {distribution[1]}%")
print(f"- Down days (0): {distribution[0]}%")


View of the distribution of the binary classification) : 
- Up days (1): 51.7%
- Down days (0): 48.3%


**Conclusion :** The dataset is well-balanced (roughly 50/50).  
This balance ensures that our model's performance will be driven by the actual predictive signals of our features, rather than a biased target distribution.

### 5. Exporting the Processed Dataset

Our feature engineering pipeline is complete. The dataset is enriched with technical indicators, lagged features, and a clean target variable. We will export this final dataframe to the `processed` folder. This file will be the direct input for our predictive models in **Notebook 3**.

In [43]:
os.makedirs("../data/processed", exist_ok=True)


output_path = "../data/processed/amzn_features.csv"
df_final.to_csv(output_path)