# Phase II: Data Curation, Exploratory Analysis and Plotting
## Stock Market Predictor

### Names: Diego Cicotoste, Ariv Ahuja, 

# Introduction 

How does the stock market work? how can you predict the stock market? what tools can you use? The stock market can seem complex and unpredictable, some would even say gambeling. One of the hardest challenges is making educated or informed decisions. The goal of this project is to tackle the uncertainty and help, stock traders make better decision on wether a stock is tradable or not. Wether to buy or sell. I would use past historical trends to make educated predictions on how the stock market would react.

### **1. Data Retrieval**  
- We used the **`yfinance` API** to retrieve daily historical **Open, High, Low, Close (OHLC)** prices and **volume data** for **S&P 500 stocks**, focusing on **Amazon (AMZN)** for the **past year**.
- The retrieved data includes essential market metrics that will serve as the foundation for feature engineering.
- We used **`NewsAPI`** to get article data on the stock

---

### **2. Data Cleaning and Processing**

#### **Handling Missing Data**  
- No data was missing after inspection

#### **Feature Engineering: Technical Indicators**  
We calculated several key **technical indicators** to enrich the dataset:
  - **RSI (Relative Strength Index)**: Momentum indicator over 14 days.
  - **VWAP (Volume Weighted Average Price)**: Measures the average trading price weighted by volume.
  - **EMA (Exponential Moving Average)**: Captures the smoothed trend over 20 days.
  - **ADX (Average Directional Index)**: Quantifies trend strength.

#### **More Features: Sentiment Analysis from News Articles**  
- We fetched relevant **news articles** using **NewsAPI** for the same period as the stock data.
- **VADER Sentiment Analysis** was used to calculate **compound sentiment scores** for each article.
- Sentiment scores were **aggregated by date** to align with the stock OHLC data.

#### **Data Alignment and Merging**  
- We ensured **alignment** between **OHLC data, technical indicators, log returns, and sentiment scores** using date-based indices.
- The combined DataFrame was prepared, with all relevant features available for further analysis and modeling.

---

### **3. Visualization of the Cleaned Data**

We visualized the **cleaned and processed dataset** to understand key trends and patterns:

1. **Price Trends and Indicators**:
   - **OHLC Candlestick Plots**: Show stock price movements.
   - **Overlaying VWAP and EMA**: To track trends and identify support/resistance levels.
   - **RSI and ADX Line Plots**: Visualize momentum and trend strength over time.

2. **Volume Analysis**:
   - **Normalized Volume**: Visualized to detect significant changes in trading activity.

3. **Sentiment Trends**:
   - **Sentiment Score Line Chart**: Displays how public sentiment fluctuates over time.
   - **Overlay of Sentiment with Stock Price**: To observe correlations between sentiment and price movements.


In [6]:
import yfinance as yf
import pandas as pd

def get_stock_data(symbol: str, period: str, interval: str = '1d') -> pd.DataFrame:
    """
    Retrieve stock price data for a given symbol, time period, and interval.
    Returns the stock prices as a pandas DataFrame.

    Parameters:
        symbol (str): The ticker symbol of the stock (e.g., 'AAPL').
        period (str): The period to retrieve data (e.g., '1y', '6mo', '5d').
        interval (str): The data interval (e.g., '1d', '1wk', '1mo').

    Returns:
        pd.DataFrame: DataFrame containing historical stock prices.
    """
    # Fetch data from Yahoo Finance
    stock_data = yf.download(symbol, period=period, interval=interval)

    return stock_data

In [7]:
import numpy as np

def calculate_log_returns(close: np.ndarray, period: int = 1) -> np.ndarray:
    """
    Calculate the log returns of the given 'Close' prices.

    Parameters:
        close (np.ndarray): Array of closing prices.
        period (int): The period over which to calculate log returns (default is 1).

    Returns:
        np.ndarray: Array of log returns.
    """
    # Shift the array using np.roll (circular shift)
    shifted = np.roll(close, period)

    # Set the first 'period' values to NaN since they don't have previous values
    shifted[:period] = np.nan

    # Calculate log returns
    log_returns = np.log(close / shifted)
    
    return log_returns


In [8]:
import pandas_ta as ta

def calculate_technical_indicators(df: pd.DataFrame) -> dict:
    """
    Calculate technical indicators and return them as NumPy arrays.

    Parameters:
        df (pd.DataFrame): DataFrame containing historical stock prices.

    Returns:
        dict: A dictionary with technical indicators as NumPy arrays.
    """
    indicators = {}

    # Calculate RSI (Relative Strength Index)
    indicators['rsi'] = ta.rsi(df['Close'], length=14).to_numpy()

    # Calculate 20-day Exponential Moving Average (EMA)
    indicators['ema_20'] = ta.ema(df['Close'], length=20).to_numpy()

    # Calculate ADX (Average Directional Index)
    adx_df = ta.adx(df['High'], df['Low'], df['Close'], length=14)
    indicators['adx'] = adx_df['ADX_14'].to_numpy()

    # Calculate VWAP (Volume Weighted Average Price)
    vwap_series = ta.vwap(df['High'], df['Low'], df['Close'], df['Volume'])
    indicators['vwap'] = vwap_series.to_numpy()

    # Calculate normalized volume
    indicators['normalized_volume'] = (df['Volume'] / df['Volume'].rolling(window=20).mean()).to_numpy()

    return indicators

In [14]:
import requests
import datetime

def get_news_articles(stock_symbol: str, api_key: str, from_date: str, to_date: str) -> list:
    """
    Retrieve news articles related to the given stock symbol along with their publication dates.

    Parameters:
        stock_symbol (str): The stock ticker symbol (e.g., 'AAPL', 'AMZN').
        api_key (str): Your NewsAPI API key.
        from_date (str): Start date for news retrieval in 'YYYY-MM-DD' format (default: 7 days before today).
        to_date (str): End date for news retrieval in 'YYYY-MM-DD' format (default: today's date).

    Returns:
        list: A list of dictionaries containing article titles, descriptions, and publication dates.
    """

    # Construct the NewsAPI request URL
    url = (
        f"https://newsapi.org/v2/everything?q={stock_symbol}&from={from_date}&to={to_date}&"
        f"sortBy=publishedAt&language=en&apiKey={api_key}"
    )

    # Make the API request
    response = requests.get(url)
    news_data = response.json()

    # Extract article details (title, description, date)
    articles = [
        {
            'title': article['title'],
            'description': article.get('description', ''),
            'publishedAt': article['publishedAt']
        }
        for article in news_data['articles']
    ]

    return articles

In [15]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

def add_sentiment_scores_to_articles(articles: list) -> list:
    """
    Add compound sentiment scores to each article dictionary.

    Parameters:
        articles (list): List of dictionaries containing article titles, descriptions, and publication dates.

    Returns:
        list: The original list of dictionaries with added 'compound_score' keys.
    """
    # Initialize VADER sentiment analyzer
    analyzer = SentimentIntensityAnalyzer()

    # Add sentiment scores to each article
    for article in articles:
        title = article['title'] or ""
        description = article.get('description', "")
        text = f"{title} {description}"

        # Get the compound sentiment score and add it to the article dictionary
        article['compound_score'] = analyzer.polarity_scores(text)['compound']

    return articles


In [16]:
def articles_to_sentiment_arr(articles: list) -> np.ndarray:
    """
    Convert a list of articles with sentiment scores into a NumPy array.

    Parameters:
        articles (list): List of dictionaries with 'publishedAt' and 'compound_score' keys.

    Returns:
        np.ndarray: Array of average compound sentiment scores grouped by whole dates.
    """
    # Prepare data for the DataFrame
    sentiment_data = [
        {'date': article['publishedAt'].split('T')[0], 'compound_score': article['compound_score']}
        for article in articles
    ]

    # Convert to DataFrame
    sentiment_df = pd.DataFrame(sentiment_data)

    # Convert 'date' to datetime format and group by date to calculate average sentiment
    sentiment_df['date'] = pd.to_datetime(sentiment_df['date'])
    aggregated_sentiment = sentiment_df.groupby(sentiment_df['date'].dt.date).mean()

    # Return the average sentiment scores as a NumPy array
    return aggregated_sentiment['compound_score'].to_numpy()

In [None]:
from my_secrets import news_api_key

# Example usage
stock = 'AMZN'
period = '1y'
interval = '1d'

stock_ohlc = get_stock_data(stock, period, interval)
log_returns_arr = calculate_log_returns(stock_ohlc['Close'])
technical_indicators_dict = calculate_technical_indicators(stock_ohlc)

first_date = stock_ohlc.index.min()
last_date = stock_ohlc.index.max()

article_list = get_news_articles(stock, news_api_key, first_date, last_date)
article_list_sent = add_sentiment_scores_to_articles(article_list)

aggregated_sentiment_arr = articles_to_sentiment_arr(article_list_sent)

stock_df = stock_ohlc
stock_df['log_returns'] = log_returns_arr
stock_df['rsi'] = technical_indicators_dict['rsi']
stock_df['ema_50'] = technical_indicators_dict['ema_50']
stock_df['vwap'] = technical_indicators_dict['vwap_series']
stock_df['normalized_volume'] = technical_indicators_dict['normalized_volume']
stock_df['sentiment_score'] = aggregated_sentiment_arr