# Task 2: Predict Future Stock Prices (Short-Term)

## Objective
Use historical stock data to predict the next day's closing price using machine learning models.

## Dataset
Stock market data from Yahoo Finance (retrieved using the yfinance Python library)

## Problem Statement
Stock price prediction is a key application of regression models in finance. We will use historical OHLCV (Open, High, Low, Close, Volume) data to train models that predict the next day's closing price. This helps understand market trends and relationships between different price features.

---

## Step 1: Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import yfinance as yf
import warnings
warnings.filterwarnings('ignore')

# Set style for plots
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("All libraries imported successfully!")

## Step 2: Download Stock Data from Yahoo Finance

In [None]:
# Select a stock ticker
ticker = "AAPL"  # Apple stock
print(f"Downloading historical data for {ticker}...")

# Download historical data
data = yf.download(ticker, start="2020-01-01", end="2025-01-01", progress=False)

print(f"Downloaded {len(data)} days of data")
print(f"\nData shape: {data.shape}")
print(f"\nFirst few rows:")
data.head()

## Step 3: Data Inspection and Cleaning

In [None]:
# Display data info
print("Data Information:")
data.info()

In [None]:
# Check for missing values
print("Missing Values:")
print(data.isnull().sum())

# Display summary statistics
print("\nSummary Statistics:")
data.describe()

## Step 4: Feature Engineering and Data Preparation

In [None]:
# Create a copy for processing
df = data.copy()

# Create target variable: Next day's closing price
df['Next_Close'] = df['Close'].shift(-1)

# Remove the last row since it doesn't have a next close price
df = df.dropna()

print(f"Data shape after creating target: {df.shape}")
print(f"\nFirst few rows with target:")
df[['Open', 'High', 'Low', 'Close', 'Volume', 'Next_Close']].head(10)

In [None]:
# Select features for the model
feature_columns = ['Open', 'High', 'Low', 'Close', 'Volume']
X = df[feature_columns]
y = df['Next_Close']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeatures used: {feature_columns}")

In [None]:
# Split data into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, shuffle=False
)

print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")
print(f"\nTraining data date range: {df.index[:-len(X_test)][0]} to {df.index[len(X_train)-1]}")
print(f"Testing data date range: {df.index[len(X_train)]} to {df.index[-1]}")

In [None]:
# Normalize the features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Features scaled successfully!")
print(f"Scaled training data shape: {X_train_scaled.shape}")
print(f"Scaled testing data shape: {X_test_scaled.shape}")

## Step 5: Model Training - Linear Regression

In [None]:
# Train Linear Regression Model
print("Training Linear Regression Model...")
lin_reg_model = LinearRegression()
lin_reg_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_lin = lin_reg_model.predict(X_test_scaled)

print("Linear Regression Model trained successfully!")

In [None]:
# Evaluate Linear Regression
mae_lin = mean_absolute_error(y_test, y_pred_lin)
rmse_lin = np.sqrt(mean_squared_error(y_test, y_pred_lin))
r2_lin = r2_score(y_test, y_pred_lin)

print("\n" + "="*50)
print("LINEAR REGRESSION MODEL EVALUATION")
print("="*50)
print(f"Mean Absolute Error (MAE): ${mae_lin:.2f}")
print(f"Root Mean Squared Error (RMSE): ${rmse_lin:.2f}")
print(f"R² Score: {r2_lin:.4f}")
print("="*50)

## Step 6: Model Training - Random Forest Regressor

In [None]:
# Train Random Forest Model
print("Training Random Forest Regressor Model...")
rf_model = RandomForestRegressor(n_estimators=200, random_state=42, n_jobs=-1)
rf_model.fit(X_train_scaled, y_train)

# Make predictions
y_pred_rf = rf_model.predict(X_test_scaled)

print("Random Forest Model trained successfully!")

In [None]:
# Evaluate Random Forest
mae_rf = mean_absolute_error(y_test, y_pred_rf)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)

print("\n" + "="*50)
print("RANDOM FOREST MODEL EVALUATION")
print("="*50)
print(f"Mean Absolute Error (MAE): ${mae_rf:.2f}")
print(f"Root Mean Squared Error (RMSE): ${rmse_rf:.2f}")
print(f"R² Score: {r2_rf:.4f}")
print("="*50)

## Step 7: Model Comparison

In [None]:
# Create comparison dataframe
comparison_df = pd.DataFrame({
    'Model': ['Linear Regression', 'Random Forest'],
    'MAE': [mae_lin, mae_rf],
    'RMSE': [rmse_lin, rmse_rf],
    'R² Score': [r2_lin, r2_rf]
})

print("\n" + "="*60)
print("MODEL COMPARISON")
print("="*60)
print(comparison_df.to_string(index=False))
print("="*60)

# Determine best model
best_model = 'Random Forest' if r2_rf > r2_lin else 'Linear Regression'
print(f"\nBest Model (based on R² Score): {best_model}")

## Step 8: Visualization - Actual vs Predicted Prices

In [None]:
# Plot Linear Regression predictions
plt.figure(figsize=(14, 6))

# Get test dates
test_dates = df.index[-len(X_test):]

plt.plot(test_dates, y_test.values, label='Actual Close Price', marker='o', markersize=3, linewidth=2)
plt.plot(test_dates, y_pred_lin, label='Linear Regression Prediction', marker='s', markersize=3, linewidth=2, alpha=0.7)
plt.xlabel('Date', fontsize=12)
plt.ylabel('Stock Price ($)', fontsize=12)
plt.title(f'{ticker} - Actual vs Linear Regression Predicted Closing Price', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Plot Random Forest predictions
plt.figure(figsize=(14, 6))

plt.plot(test_dates, y_test.values, label='Actual Close Price', marker='o', markersize=3, linewidth=2)
plt.plot(test_dates, y_pred_rf, label='Random Forest Prediction', marker='s', markersize=3, linewidth=2, alpha=0.7)
plt.xlabel('Date', fontsize=12)
plt.ylabel('Stock Price ($)', fontsize=12)
plt.title(f'{ticker} - Actual vs Random Forest Predicted Closing Price', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## Step 9: Scatter Plot - Actual vs Predicted

In [None]:
# Create subplots for both models
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Linear Regression scatter plot
axes[0].scatter(y_test, y_pred_lin, alpha=0.5, s=30)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0].set_xlabel('Actual Price ($)', fontsize=11)
axes[0].set_ylabel('Predicted Price ($)', fontsize=11)
axes[0].set_title(f'Linear Regression\nR² = {r2_lin:.4f}', fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Random Forest scatter plot
axes[1].scatter(y_test, y_pred_rf, alpha=0.5, s=30)
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[1].set_xlabel('Actual Price ($)', fontsize=11)
axes[1].set_ylabel('Predicted Price ($)', fontsize=11)
axes[1].set_title(f'Random Forest\nR² = {r2_rf:.4f}', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Step 10: Feature Importance (Random Forest)

In [None]:
# Get feature importance from Random Forest
feature_importance = pd.DataFrame({
    'Feature': feature_columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nFeature Importance (Random Forest):")
print(feature_importance.to_string(index=False))

In [None]:
# Plot feature importance
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['Feature'], feature_importance['Importance'], color='steelblue')
plt.xlabel('Importance Score', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Feature Importance - Random Forest Model', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

## Step 11: Residual Analysis

In [None]:
# Calculate residuals
residuals_lin = y_test.values - y_pred_lin
residuals_rf = y_test.values - y_pred_rf

# Create residual plots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Linear Regression residual plot
axes[0, 0].scatter(y_pred_lin, residuals_lin, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--', lw=2)
axes[0, 0].set_xlabel('Predicted Price', fontsize=11)
axes[0, 0].set_ylabel('Residuals', fontsize=11)
axes[0, 0].set_title('Linear Regression - Residual Plot', fontsize=12, fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)

# Random Forest residual plot
axes[0, 1].scatter(y_pred_rf, residuals_rf, alpha=0.5)
axes[0, 1].axhline(y=0, color='r', linestyle='--', lw=2)
axes[0, 1].set_xlabel('Predicted Price', fontsize=11)
axes[0, 1].set_ylabel('Residuals', fontsize=11)
axes[0, 1].set_title('Random Forest - Residual Plot', fontsize=12, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)

# Linear Regression histogram of residuals
axes[1, 0].hist(residuals_lin, bins=30, edgecolor='black', alpha=0.7)
axes[1, 0].set_xlabel('Residuals', fontsize=11)
axes[1, 0].set_ylabel('Frequency', fontsize=11)
axes[1, 0].set_title('Linear Regression - Residuals Distribution', fontsize=12, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)

# Random Forest histogram of residuals
axes[1, 1].hist(residuals_rf, bins=30, edgecolor='black', alpha=0.7)
axes[1, 1].set_xlabel('Residuals', fontsize=11)
axes[1, 1].set_ylabel('Frequency', fontsize=11)
axes[1, 1].set_title('Random Forest - Residuals Distribution', fontsize=12, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Step 12: Key Findings and Insights

In [None]:
print("\n" + "="*70)
print("KEY FINDINGS AND INSIGHTS")
print("="*70)

print(f"\n1. DATASET OVERVIEW:")
print(f"   - Stock Ticker: {ticker}")
print(f"   - Total trading days: {len(df)}")
print(f"   - Training samples: {len(X_train)}")
print(f"   - Testing samples: {len(X_test)}")
print(f"   - Date range: {df.index[0].date()} to {df.index[-1].date()}")

print(f"\n2. PRICE STATISTICS:")
print(f"   - Average closing price: ${y.mean():.2f}")
print(f"   - Min price: ${y.min():.2f}")
print(f"   - Max price: ${y.max():.2f}")
print(f"   - Standard deviation: ${y.std():.2f}")

print(f"\n3. MODEL PERFORMANCE COMPARISON:")
print(f"   \n   Linear Regression:")
print(f"   - MAE: ${mae_lin:.2f}")
print(f"   - RMSE: ${rmse_lin:.2f}")
print(f"   - R² Score: {r2_lin:.4f}")

print(f"   \n   Random Forest:")
print(f"   - MAE: ${mae_rf:.2f}")
print(f"   - RMSE: ${rmse_rf:.2f}")
print(f"   - R² Score: {r2_rf:.4f}")

print(f"\n4. FEATURE IMPORTANCE (Random Forest):")
for idx, row in feature_importance.iterrows():
    print(f"   - {row['Feature']}: {row['Importance']:.4f}")

print(f"\n5. MODEL SELECTION:")
print(f"   - Best Model: {best_model}")
print(f"   - This model has better R² score and lower prediction errors.")

print(f"\n6. CONCLUSION:")
print(f"   - Stock price prediction is challenging due to market volatility.")
print(f"   - {best_model} shows better generalization to unseen data.")
print(f"   - The model captures general trends but may miss sudden price spikes.")
print(f"   - Historical price features are moderately predictive of future prices.")
print(f"   - For production use, consider adding external factors (news, sentiment).")

print("\n" + "="*70)

## Summary

In this task, we successfully:
1. ✅ Downloaded historical stock data using yfinance API
2. ✅ Loaded and inspected the OHLCV dataset
3. ✅ Created target variable (next day's closing price)
4. ✅ Split data into training and testing sets (80-20)
5. ✅ Trained two models: Linear Regression and Random Forest
6. ✅ Evaluated models using MAE, RMSE, and R² metrics
7. ✅ Compared model performance and selected the best one
8. ✅ Visualized actual vs predicted prices
9. ✅ Analyzed feature importance
10. ✅ Performed residual analysis

**Skills Demonstrated:**
- Time series data handling and API integration (yfinance)
- Regression modeling with multiple algorithms
- Data preprocessing and feature scaling
- Model evaluation and comparison
- Data visualization and interpretation
- Feature importance analysis
- Residual analysis and model diagnostics