# üìà Stock Price Prediction & Sentiment Analysis - Complete Workflow

This notebook demonstrates the complete end-to-end workflow for stock price prediction using machine learning and sentiment analysis.

## üéØ What You'll Learn:
- üìä Data collection from financial APIs (yfinance)
- üîß Feature engineering with technical indicators
- ü§ñ Training multiple ML models (Random Forest, Gradient Boosting, XGBoost)
- üìà Making price predictions
- üí¨ Sentiment analysis on financial text
- üìâ Visualizing results with interactive charts

**Author:** Stock Prediction System  
**Date:** November 2025  
**Version:** 1.0

---

## üì¶ 1. Setup and Imports

First, let's import all necessary libraries.

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Visualization
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import xgboost as xgb

# Custom modules
from data_collector import StockDataCollector
from sentiment_analyzer import SentimentAnalyzer
from ml_models import StockPricePredictor, ModelComparison

print("‚úÖ All libraries imported successfully!")

## üìä 2. Data Collection

Let's collect historical stock data using yfinance. We'll use Bitcoin (BTC-USD) as an example.

In [None]:
# Initialize data collector
collector = StockDataCollector()

# Configuration
SYMBOL = "BTC-USD"  # You can change this to any stock: AAPL, TSLA, ETH-USD, etc.
INTERVAL = "1h"     # 1h for hourly, 1d for daily
PERIOD = "3mo"      # 7d, 1mo, 3mo, 6mo, 1y

print(f"üì• Collecting data for {SYMBOL}...")
print(f"   Interval: {INTERVAL}")
print(f"   Period: {PERIOD}")
print()

# Collect data
df = collector.get_stock_data(SYMBOL, interval=INTERVAL, period=PERIOD)

if df is not None and not df.empty:
    print(f"‚úÖ Successfully collected {len(df)} data points")
    print(f"   Date range: {df.index[0]} to {df.index[-1]}")
    print()
    print("First few rows:")
    display(df.head())
else:
    print("‚ùå Failed to collect data")

## üîß 3. Feature Engineering

Now let's add technical indicators to our data.

In [None]:
# Add technical indicators
print("üîß Engineering features...")
df_features = collector.add_technical_indicators(df)

print(f"‚úÖ Added {len(df_features.columns) - len(df.columns)} new features")
print()
print("Available features:")
print(df_features.columns.tolist())
print()
print("Data with features:")
display(df_features.tail())

## üìà 4. Data Visualization

Let's visualize the stock price and technical indicators.

In [None]:
# Create candlestick chart with indicators
fig = make_subplots(
    rows=3, cols=1,
    shared_xaxes=True,
    vertical_spacing=0.05,
    subplot_titles=('Price & Moving Averages', 'RSI', 'Volume'),
    row_heights=[0.5, 0.25, 0.25]
)

# Candlestick
fig.add_trace(
    go.Candlestick(
        x=df_features.index,
        open=df_features['open'],
        high=df_features['high'],
        low=df_features['low'],
        close=df_features['close'],
        name='Price'
    ),
    row=1, col=1
)

# Moving averages
fig.add_trace(
    go.Scatter(x=df_features.index, y=df_features['sma_7'], name='SMA 7', line=dict(color='orange')),
    row=1, col=1
)
fig.add_trace(
    go.Scatter(x=df_features.index, y=df_features['sma_25'], name='SMA 25', line=dict(color='blue')),
    row=1, col=1
)

# RSI
fig.add_trace(
    go.Scatter(x=df_features.index, y=df_features['rsi'], name='RSI', line=dict(color='purple')),
    row=2, col=1
)
fig.add_hline(y=70, line_dash="dash", line_color="red", row=2, col=1)
fig.add_hline(y=30, line_dash="dash", line_color="green", row=2, col=1)

# Volume
fig.add_trace(
    go.Bar(x=df_features.index, y=df_features['volume'], name='Volume', marker_color='lightblue'),
    row=3, col=1
)

fig.update_layout(
    title=f'{SYMBOL} - Technical Analysis',
    height=800,
    xaxis_rangeslider_visible=False,
    showlegend=True
)

fig.show()

## ü§ñ 5. Model Training

Now let's train multiple machine learning models to predict price movements.

In [None]:
# Initialize predictor
predictor = StockPricePredictor()

# Prepare data
print("üìä Preparing training data...")
X, y, feature_names = predictor.prepare_data(df_features)

if X is not None:
    print(f"‚úÖ Data prepared successfully")
    print(f"   Features: {X.shape[1]}")
    print(f"   Samples: {X.shape[0]}")
    print(f"   Target distribution: {dict(pd.Series(y).value_counts())}")
    print()
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    
    print(f"   Training samples: {len(X_train)}")
    print(f"   Testing samples: {len(X_test)}")
else:
    print("‚ùå Failed to prepare data")

In [None]:
# Train Random Forest
print("\n" + "="*60)
print("Training Random Forest...")
print("="*60)

rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
rf_model.fit(X_train, y_train)

rf_train_acc = accuracy_score(y_train, rf_model.predict(X_train))
rf_test_acc = accuracy_score(y_test, rf_model.predict(X_test))

print(f"Train Accuracy: {rf_train_acc:.4f}")
print(f"Test Accuracy:  {rf_test_acc:.4f}")
print()
print("Classification Report:")
print(classification_report(y_test, rf_model.predict(X_test)))

In [None]:
# Train Gradient Boosting
print("\n" + "="*60)
print("Training Gradient Boosting...")
print("="*60)

gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_model.fit(X_train, y_train)

gb_train_acc = accuracy_score(y_train, gb_model.predict(X_train))
gb_test_acc = accuracy_score(y_test, gb_model.predict(X_test))

print(f"Train Accuracy: {gb_train_acc:.4f}")
print(f"Test Accuracy:  {gb_test_acc:.4f}")
print()
print("Classification Report:")
print(classification_report(y_test, gb_model.predict(X_test)))

In [None]:
# Train XGBoost
print("\n" + "="*60)
print("Training XGBoost...")
print("="*60)

xgb_model = xgb.XGBClassifier(n_estimators=100, random_state=42, n_jobs=-1)
xgb_model.fit(X_train, y_train)

xgb_train_acc = accuracy_score(y_train, xgb_model.predict(X_train))
xgb_test_acc = accuracy_score(y_test, xgb_model.predict(X_test))

print(f"Train Accuracy: {xgb_train_acc:.4f}")
print(f"Test Accuracy:  {xgb_test_acc:.4f}")
print()
print("Classification Report:")
print(classification_report(y_test, xgb_model.predict(X_test)))

## üìä 6. Model Comparison

Let's compare the performance of all three models.

In [None]:
# Create comparison dataframe
comparison_df = pd.DataFrame({
    'Model': ['Random Forest', 'Gradient Boosting', 'XGBoost'],
    'Train Accuracy': [rf_train_acc, gb_train_acc, xgb_train_acc],
    'Test Accuracy': [rf_test_acc, gb_test_acc, xgb_test_acc]
})

comparison_df = comparison_df.sort_values('Test Accuracy', ascending=False)

print("="*60)
print("MODEL COMPARISON")
print("="*60)
display(comparison_df)

# Visualize comparison
fig = px.bar(
    comparison_df,
    x='Model',
    y=['Train Accuracy', 'Test Accuracy'],
    barmode='group',
    title='Model Performance Comparison',
    labels={'value': 'Accuracy', 'variable': 'Dataset'}
)
fig.show()

best_model_name = comparison_df.iloc[0]['Model']
best_accuracy = comparison_df.iloc[0]['Test Accuracy']
print(f"\nüèÜ Best Model: {best_model_name} with {best_accuracy:.4f} accuracy")

## üîç 7. Feature Importance

Let's analyze which features are most important for predictions.

In [None]:
# Get feature importance from best model
if best_model_name == 'Random Forest':
    best_model = rf_model
elif best_model_name == 'Gradient Boosting':
    best_model = gb_model
else:
    best_model = xgb_model

# Create feature importance dataframe
feature_importance = pd.DataFrame({
    'Feature': feature_names,
    'Importance': best_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("Top 10 Most Important Features:")
display(feature_importance.head(10))

# Visualize
fig = px.bar(
    feature_importance.head(15),
    x='Importance',
    y='Feature',
    orientation='h',
    title=f'Top 15 Feature Importance - {best_model_name}',
    labels={'Importance': 'Importance Score'}
)
fig.update_layout(height=600)
fig.show()

## üéØ 8. Making Predictions

Let's use our trained model to make predictions on recent data.

In [None]:
# Get latest data point
latest_data = X[-1:]
latest_features = pd.DataFrame(latest_data, columns=feature_names)

print("Latest data point features:")
display(latest_features.T)

# Make prediction
prediction = best_model.predict(latest_data)[0]
probabilities = best_model.predict_proba(latest_data)[0]

print("\n" + "="*60)
print("PREDICTION RESULT")
print("="*60)
print(f"Prediction: {'üìà UP (Price will increase)' if prediction == 1 else 'üìâ DOWN (Price will decrease)'}")
print(f"Confidence: {max(probabilities):.2%}")
print()
print(f"Probability of DOWN: {probabilities[0]:.2%}")
print(f"Probability of UP:   {probabilities[1]:.2%}")

## üí¨ 9. Sentiment Analysis

Now let's analyze sentiment from financial news or social media text.

In [None]:
# Initialize sentiment analyzer
sentiment_analyzer = SentimentAnalyzer()

# Example texts
texts = [
    "Bitcoin surges to new all-time high as institutional investors pile in!",
    "Stock market crashes amid fears of recession and rising inflation.",
    "Apple announces record-breaking quarterly earnings, beating expectations.",
    "Tesla faces production delays and supply chain issues.",
    "Cryptocurrency market shows strong recovery with positive momentum."
]

print("="*60)
print("SENTIMENT ANALYSIS RESULTS")
print("="*60)
print()

results = []
for i, text in enumerate(texts, 1):
    result = sentiment_analyzer.analyze_text(text)
    results.append(result)
    
    print(f"{i}. Text: {text[:60]}...")
    print(f"   Sentiment: {result['sentiment']}")
    print(f"   Confidence: {result['confidence']:.2%}")
    print(f"   Scores: {result['scores']}")
    print()

# Create sentiment distribution
sentiment_df = pd.DataFrame(results)
sentiment_counts = sentiment_df['sentiment'].value_counts()

fig = px.pie(
    values=sentiment_counts.values,
    names=sentiment_counts.index,
    title='Sentiment Distribution',
    color_discrete_map={'positive': 'green', 'negative': 'red', 'neutral': 'gray'}
)
fig.show()

## üìâ 10. Model Evaluation - Confusion Matrix

Let's visualize the confusion matrix to understand model performance better.

In [None]:
# Calculate confusion matrix
y_pred = best_model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

# Create heatmap
fig = px.imshow(
    cm,
    labels=dict(x="Predicted", y="Actual", color="Count"),
    x=['DOWN', 'UP'],
    y=['DOWN', 'UP'],
    title=f'Confusion Matrix - {best_model_name}',
    text_auto=True,
    color_continuous_scale='Blues'
)
fig.show()

# Calculate metrics
tn, fp, fn, tp = cm.ravel()
print("\nDetailed Metrics:")
print(f"True Negatives:  {tn}")
print(f"False Positives: {fp}")
print(f"False Negatives: {fn}")
print(f"True Positives:  {tp}")
print()
print(f"Precision: {tp / (tp + fp):.4f}")
print(f"Recall:    {tp / (tp + fn):.4f}")
print(f"F1-Score:  {2 * tp / (2 * tp + fp + fn):.4f}")

## üìù 11. Summary and Conclusions

### Key Takeaways:

1. **Data Collection**: Successfully collected and processed historical stock data
2. **Feature Engineering**: Added 20+ technical indicators for better predictions
3. **Model Training**: Trained and compared 3 different ML models
4. **Best Model**: {best_model_name} achieved the highest accuracy
5. **Predictions**: Model can predict price movements with reasonable confidence
6. **Sentiment Analysis**: Successfully analyzed sentiment from financial text

### Next Steps:

- üîÑ Retrain models periodically with fresh data
- üìä Add more features (news sentiment, social media trends)
- üéØ Fine-tune hyperparameters for better accuracy
- üìà Implement real-time prediction pipeline
- üöÄ Deploy as a web application (already done with Streamlit!)

### Important Notes:

‚ö†Ô∏è **Disclaimer**: This is for educational purposes only. Past performance does not guarantee future results. Always do your own research before making investment decisions.

---

**Thank you for using this notebook!** üéâ

For the full web application, check out: https://mystockprediction.streamlit.app