# TCS Stock Data — Machine Learning Models
**Internship Project | TCS Stock Data – Live and Latest**

This notebook trains and evaluates machine learning models to predict the TCS closing price.

### Models covered:
| # | Model | Type | Library |
|---|---|---|---|
| A | Linear Regression | Baseline | scikit-learn |
| B | Random Forest | Ensemble | scikit-learn |
| C | LSTM | Deep Learning | TensorFlow/Keras |

---

## 0. Setup — Imports and Path Configuration

In [None]:
import os
import sys
import warnings
warnings.filterwarnings('ignore')

# Add project root to sys.path
PROJECT_ROOT = os.path.dirname(os.getcwd())
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns

from sklearn.linear_model    import LinearRegression
from sklearn.ensemble        import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing   import MinMaxScaler
from sklearn.metrics         import mean_squared_error, r2_score, mean_absolute_error

# Project modules
from src.data_loader import load_tcs_data
from src.features    import build_features
from src.models      import (
    prepare_ml_data,
    run_linear_regression,
    run_random_forest,
    run_lstm,
)

plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

print('✓ Imports complete!')

---
## 1. Load and Prepare Data

In [None]:
# Load raw data
raw_df = load_tcs_data()
print(f'Raw data shape  : {raw_df.shape}')
raw_df.head(3)

In [None]:
# Build feature-enriched DataFrame (adds lags, MAs, calendar features, etc.)
feature_df = build_features(raw_df)
print(f'\nFeature DataFrame shape: {feature_df.shape}')
feature_df.head(3)

In [None]:
# All feature columns available for modeling
print('Features in the enriched DataFrame:')
for i, col in enumerate(feature_df.columns, 1):
    print(f'  {i:2}. {col}')

---
## 2. Model A — Linear Regression (Baseline)

**Why start here?**  
Linear Regression is interpretable and fast. It serves as a *baseline* — if our complex models can't beat this, something is wrong.

**How it works:**  
It fits a straight-line (hyperplane) relationship: `Close ≈ w₁·Open + w₂·Prev_Close_1 + ... + b`

**Metrics explained:**
- **MSE (Mean Squared Error)** — average squared error (lower = better)
- **RMSE** — same unit as price, easier to interpret (lower = better)
- **MAE** — average absolute error in INR (lower = better)
- **R²** — proportion of variance explained, 1.0 = perfect, 0 = no better than mean

In [None]:
lr_model = run_linear_regression(feature_df)

In [None]:
# Inspect the model coefficients (what the model learned)
from src.models import LR_FEATURES
available_features = [f for f in LR_FEATURES if f in feature_df.columns]

coef_df = pd.DataFrame({
    'Feature'    : available_features,
    'Coefficient': lr_model.coef_
}).sort_values('Coefficient', key=abs, ascending=False)

print('Linear Regression Coefficients (sorted by |magnitude|):')
print(coef_df.to_string(index=False))

---
## 3. Model B — Random Forest Regressor

**Why Random Forest?**  
Stock prices have many non-linear interactions. Decision trees can model these, and
Random Forest reduces overfitting by averaging over many trees (bagging ensemble).

**n_estimators = 100** means we build 100 decision trees and average their predictions.

In [None]:
rf_model = run_random_forest(feature_df, n_estimators=100)

---
## 4. Model Comparison — LR vs Random Forest

In [None]:
# Re-compute metrics for comparison table
from src.models import LR_FEATURES, TARGET

X_train, X_test, y_train, y_test, dates_test, _ = prepare_ml_data(feature_df)

results = {}
for name, model in [('Linear Regression', lr_model), ('Random Forest', rf_model)]:
    y_pred = model.predict(X_test)
    results[name] = {
        'MAE'  : mean_absolute_error(y_test, y_pred),
        'RMSE' : np.sqrt(mean_squared_error(y_test, y_pred)),
        'R²'   : r2_score(y_test, y_pred),
    }

comparison_df = pd.DataFrame(results).T.round(4)
print('Model Performance Comparison:')
comparison_df

---
## 5. Model C — LSTM (Optional Deep Learning)

**What is LSTM?**  
Long Short-Term Memory is a special type of Recurrent Neural Network (RNN) designed to learn long-range dependencies in sequences. Unlike Linear Regression, it processes price sequences (e.g. the last 60 trading days) to predict the next day's price.

**Architecture:**
```
Input (60 days) → LSTM(64) → Dropout(0.2) → LSTM(32) → Dropout(0.2) → Dense(1) → Predicted Close
```

> ⚠️ **Requires TensorFlow.** If not installed, run: `pip install tensorflow`  
> Training may take a few minutes.

In [None]:
# Run LSTM model on the raw Close price series (uses raw_df, not feature_df)
lstm_result = run_lstm(raw_df, lookback=60, epochs=20, batch_size=32)

---
## 6. Conclusions and Observations

| Model | Strengths | Weaknesses |
|---|---|---|
| Linear Regression | Fast, interpretable | Assumes linearity, misses complex patterns |
| Random Forest | Handles non-linearity, robust | Black-box, can overfit without tuning |
| LSTM | Understands sequences over time | Requires more data, slower training, complex to tune |

### Key Takeaways:
- `Prev_Close_1` (yesterday's price) is consistently the #1 most important feature.
- Moving average features (MA_7, MA_30) help capture trend direction.
- Random Forest typically outperforms Linear Regression on financial data.
- LSTM adds value for truly sequential modelling but requires careful tuning.

### Future Improvements:
- Try **XGBoost** or **LightGBM** for even better tree-based performance.
- Use **ARIMA / SARIMA** for classical time-series forecasting.
- Add **external sentiment data** (news headlines) as features.
- Implement **hyperparameter tuning** with `GridSearchCV` or `Optuna`.
- Deploy as a **Flask/Streamlit web app** for real-time predictions.