
# AlphaPulse â€“ Final ML Project (Clean & Explained)

**Objective:**  
Predict meaningful future price movements using Machine Learning  
and compare multiple models in a clean, explainable way.



## 1. Machine Learning Pipeline

1. Data Collection  
2. Feature Engineering  
3. Target Creation (Future Returns)  
4. Train-Test Split  
5. Scaling  
6. Model Training  
7. Model Comparison  
8. Final Conclusion


## 2. Import Libraries

In [1]:

import numpy as np
import pandas as pd
import yfinance as yf

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, roc_auc_score

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
import warnings
warnings.filterwarnings('ignore')

## 3. Data Ingestion

In [2]:
ticker='RELIANCE.NS '
data = yf.download(ticker, start="2020-01-01", end="2024-01-01")
data.head()

[*********************100%***********************]  1 of 1 completed


Price,Close,High,Low,Open,Volume
Ticker,RELIANCE.NS,RELIANCE.NS,RELIANCE.NS,RELIANCE.NS,RELIANCE.NS
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
2020-01-01,675.32428,683.152975,673.490184,679.082058,14004468
2020-01-02,686.821228,689.348791,676.397899,676.397899,17710316
2020-01-03,687.648804,689.661895,681.318729,685.792252,20984698
2020-01-06,671.700623,683.510705,670.134872,679.976657,24519177
2020-01-07,682.034485,686.463273,677.068828,679.52926,16683622


In [3]:
df = data[['Close']]
df.head()

Price,Close
Ticker,RELIANCE.NS
Date,Unnamed: 1_level_2
2020-01-01,675.32428
2020-01-02,686.821228
2020-01-03,687.648804
2020-01-06,671.700623
2020-01-07,682.034485


## 4. Feature Engineering

In [4]:

df['return'] = df['Close'].pct_change()
df['future_return'] = df['Close'].shift(-1) / df['Close'] - 1
df.dropna(inplace=True)
df.head()


Price,Close,return,future_return
Ticker,RELIANCE.NS,Unnamed: 2_level_1,Unnamed: 3_level_1
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
2020-01-02,686.821228,0.017024,0.001205
2020-01-03,687.648804,0.001205,-0.023192
2020-01-06,671.700623,-0.023192,0.015385
2020-01-07,682.034485,0.015385,-0.00751
2020-01-08,676.912354,-0.00751,0.023031



## 5. Target Creation

Target = 1 if **future return > 0.5%**  
Target = 0 otherwise


In [5]:

df['target'] = (df['future_return'] > 0.005).astype(int)
df['target'].value_counts(normalize=True)


target
0    0.642424
1    0.357576
Name: proportion, dtype: float64

## 6. Train-Test Split

In [6]:

x = df[['return']]
y = df['target']

xtrain, xtest, ytrain, ytest = train_test_split(
    x, y, test_size=0.3, shuffle=False
)


## 7. Feature Scaling

In [7]:

scaler = StandardScaler()
xtrain_scaled = scaler.fit_transform(xtrain)
xtest_scaled = scaler.transform(xtest)


## 8. Logistic Regression (Baseline Model)

In [8]:

log_model = LogisticRegression(class_weight='balanced')
log_model.fit(xtrain_scaled, ytrain)

log_proba = log_model.predict_proba(xtest_scaled)[:,1]
log_pred = (log_proba >= 0.5).astype(int)

log_acc = accuracy_score(ytest, log_pred)
log_auc = roc_auc_score(ytest, log_proba)

log_acc, log_auc


(0.48148148148148145, np.float64(0.4549237170596394))

## 9. K-Nearest Neighbors

In [9]:

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(xtrain_scaled, ytrain)

knn_proba = knn.predict_proba(xtest_scaled)[:,1]
knn_pred = (knn_proba >= 0.5).astype(int)

knn_acc = accuracy_score(ytest, knn_pred)
knn_auc = roc_auc_score(ytest, knn_proba)

knn_acc, knn_auc


(0.6498316498316499, np.float64(0.5823108929905046))

## 10. Decision Tree

In [10]:

dt = DecisionTreeClassifier(max_depth=3, random_state=42)
dt.fit(xtrain, ytrain)

dt_proba = dt.predict_proba(xtest)[:,1]
dt_pred = (dt_proba >= 0.5).astype(int)

dt_acc = accuracy_score(ytest, dt_pred)
dt_auc = roc_auc_score(ytest, dt_proba)

dt_acc, dt_auc


(0.6868686868686869, np.float64(0.5164034994132082))

## 11. Random Forest (Hyperparameter Tuned)

In [11]:

rf = RandomForestClassifier(class_weight='balanced', random_state=42)

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [2, 3, 4]
}

rf_grid = GridSearchCV(rf, param_grid, scoring='roc_auc', cv=3)
rf_grid.fit(xtrain, ytrain)

rf_best = rf_grid.best_estimator_

rf_proba = rf_best.predict_proba(xtest)[:,1]
rf_pred = (rf_proba >= 0.5).astype(int)

rf_acc = accuracy_score(ytest, rf_pred)
rf_auc = roc_auc_score(ytest, rf_proba)

rf_acc, rf_auc


(0.5521885521885522, np.float64(0.5517176997759522))

## 12. Gradient Boosting

In [12]:

gb = GradientBoostingClassifier()
gb.fit(xtrain, ytrain)

gb_proba = gb.predict_proba(xtest)[:,1]
gb_pred = (gb_proba >= 0.5).astype(int)

gb_acc = accuracy_score(ytest, gb_pred)
gb_auc = roc_auc_score(ytest, gb_proba)

gb_acc, gb_auc


(0.6936026936026936, np.float64(0.5538781606742771))

## 13. Support Vector Machine

In [13]:

svm = SVC(probability=True)
svm.fit(xtrain_scaled, ytrain)

svm_proba = svm.predict_proba(xtest_scaled)[:,1]
svm_pred = (svm_proba >= 0.5).astype(int)

svm_acc = accuracy_score(ytest, svm_pred)
svm_auc = roc_auc_score(ytest, svm_proba)

svm_acc, svm_auc


(0.6936026936026936, np.float64(0.5442761122372772))

## 14. Model Comparison

In [14]:

results = pd.DataFrame({
    'Model': ['Logistic', 'KNN', 'Decision Tree', 'Random Forest', 'Gradient Boosting', 'SVM'],
    'Accuracy': [log_acc, knn_acc, dt_acc, rf_acc, gb_acc, svm_acc],
    'ROC_AUC': [log_auc, knn_auc, dt_auc, rf_auc, gb_auc, svm_auc]
})

results


Unnamed: 0,Model,Accuracy,ROC_AUC
0,Logistic,0.481481,0.454924
1,KNN,0.649832,0.582311
2,Decision Tree,0.686869,0.516403
3,Random Forest,0.552189,0.551718
4,Gradient Boosting,0.693603,0.553878
5,SVM,0.693603,0.544276



## 15. Final Conclusion

- We predicted **meaningful future price movements**
- Multiple models were trained and compared fairly
- Gradient Boosting / Random Forest typically perform best
- Problem formulation mattered more than model complexity

**Key takeaway:**  
Clean framing + honest evaluation > unrealistic accuracy.
