# Phase II: Data Curation, Exploratory Analysis and Plotting
## Stock Market Predictor

### Names: Diego Cicotoste, Ariv Ahuja, 

# Introduction 

How does the stock market work? how can you predict the stock market? what tools can you use? The stock market can seem complex and unpredictable, some would even say gambeling. One of the hardest challenges is making educated or informed decisions. The goal of this project is to tackle the uncertainty and help, stock traders make better decision on wether a stock is tradable or not. Wether to buy or sell. I would use past historical trends to make educated predictions on how the stock market would react.

### **1. Data Retrieval**  
- We used the **`yfinance` API** to retrieve daily historical **Open, High, Low, Close (OHLC)** prices and **volume data** for **S&P 500 stocks**, focusing on **Amazon (AMZN)** for the **past year**.
- The retrieved data includes essential market metrics that will serve as the foundation for feature engineering.

---

### **2. Data Cleaning and Processing**

#### **Handling Missing Data**  
- No data was missing after inspection

#### **Feature Engineering: Technical Indicators**  
We calculated several key **technical indicators** to enrich the dataset:
  - **RSI (Relative Strength Index)**: Momentum indicator over 14 days.
  - **VWAP (Volume Weighted Average Price)**: Measures the average trading price weighted by volume.
  - **EMA (Exponential Moving Average)**: Captures the smoothed trend over 20 days.
  - **ADX (Average Directional Index)**: Quantifies trend strength.

#### **More Features: Sentiment Analysis from News Articles**  
- We fetched relevant **news articles** using **NewsAPI** for the same period as the stock data.
- **VADER Sentiment Analysis** was used to calculate **compound sentiment scores** for each article.
- Sentiment scores were **aggregated by date** to align with the stock OHLC data.

#### **Data Alignment and Merging**  
- We ensured **alignment** between **OHLC data, technical indicators, log returns, and sentiment scores** using date-based indices.
- The combined DataFrame was prepared, with all relevant features available for further analysis and modeling.

---

### **3. Visualization of the Cleaned Data**

We visualized the **cleaned and processed dataset** to understand key trends and patterns:

1. **Price Trends and Indicators**:
   - **OHLC Candlestick Plots**: Show stock price movements.
   - **Overlaying VWAP and EMA**: To track trends and identify support/resistance levels.
   - **RSI and ADX Line Plots**: Visualize momentum and trend strength over time.

2. **Volume Analysis**:
   - **Normalized Volume**: Visualized to detect significant changes in trading activity.

3. **Sentiment Trends**:
   - **Sentiment Score Line Chart**: Displays how public sentiment fluctuates over time.
   - **Overlay of Sentiment with Stock Price**: To observe correlations between sentiment and price movements.


In [72]:
import yfinance as yf
import pandas as pd

def get_stock_data(symbol: str, period: str, interval: str = '1d') -> pd.DataFrame:
    """
    Retrieve stock price data for a given symbol, time period, and interval.
    Returns the stock prices as a pandas DataFrame.

    Parameters:
        symbol (str): The ticker symbol of the stock (e.g., 'AAPL').
        period (str): The period to retrieve data (e.g., '1y', '6mo', '5d').
        interval (str): The data interval (e.g., '1d', '1wk', '1mo').

    Returns:
        pd.DataFrame: DataFrame containing historical stock prices.
    """
    # Fetch data from Yahoo Finance
    stock_data = yf.download(symbol, period=period, interval=interval)

    if stock_data.empty:
        print(f"No data found for {symbol}.")
        return None

    return stock_data

In [73]:
import numpy as np

def calculate_log_returns(close: np.ndarray) -> np.ndarray:
    """
    Calculate the log returns from the close prices.
    
    Parameters:
        close (np.ndarray): Array of closing prices.
    
    Returns:
        np.ndarray: Array of log returns.
    """
    log_returns = np.log(close / close.shift(1))
    return log_returns


In [74]:
import pandas_ta as ta

def calculate_technical_indicators(df: pd.DataFrame) -> dict:
    """
    Calculate technical indicators and return them as NumPy arrays.

    Parameters:
        df (pd.DataFrame): DataFrame containing historical stock prices.

    Returns:
        dict: A dictionary with technical indicators as NumPy arrays.
    """
    indicators = {}

    # Calculate RSI (Relative Strength Index)
    indicators['rsi'] = ta.rsi(df['Close'], length=14).to_numpy()

    # Calculate 20-day Exponential Moving Average (EMA)
    indicators['ema_20'] = ta.ema(df['Close'], length=20).to_numpy()

    # Calculate ADX (Average Directional Index)
    adx_df = ta.adx(df['High'], df['Low'], df['Close'], length=14)
    indicators['adx'] = adx_df['ADX_14'].to_numpy()

    # Calculate VWAP (Volume Weighted Average Price)
    vwap_series = ta.vwap(df['High'], df['Low'], df['Close'], df['Volume'])
    indicators['vwap'] = vwap_series.to_numpy()

    # Calculate normalized volume
    indicators['normalized_volume'] = (df['Volume'] / df['Volume'].rolling(window=20).mean()).to_numpy()

    return indicators

In [75]:
import yfinance as yf
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

def get_latest_news_yahoo(stock_symbol: str) -> list:
    url = f"https://finance.yahoo.com/quote/{stock_symbol}"
    
    try:
        response = requests.get(url)
        response.raise_for_status()

        soup = BeautifulSoup(response.content, 'lxml')
        articles = []
        news_section = soup.find_all('a', href=True)

        for link in news_section:
            href = link['href']
            title = link.text.strip()
            if '/news/' in href and title:
                full_url = "https://finance.yahoo.com" + href
                articles.append({'title': title, 'url': full_url})

        return articles[:5]  

    except requests.exceptions.RequestException as e:
        return f"Network error: {e}"


In [76]:
from textblob import TextBlob

def add_sentiment_scores_to_articles(articles: list) -> list:
    """
    Add sentiment scores to the provided list of articles using TextBlob.
    
    Parameters:
        articles (list): A list of dictionaries where each dictionary contains article information 
                         like 'title' and optionally 'description'.
                         
    Returns:
        list: A list of articles, each with an added 'sentiment_score' representing the polarity score.
    """
    articles_with_sentiment = []

    # Loop through each article and calculate the sentiment score based on the title and description
    for article in articles:
        title = article.get('title', "")
        description = article.get('description', "")
        text = f"{title} {description}"
        
        # Use TextBlob to calculate the sentiment polarity score (-1 to 1)
        sentiment = TextBlob(text).sentiment.polarity
        
        # Add the sentiment score to each article dictionary
        article['sentiment_score'] = sentiment
        articles_with_sentiment.append(article)

    return articles_with_sentiment


In [77]:
import numpy as np
import pandas as pd

def articles_to_sentiment_arr(articles: list) -> np.ndarray:
    """
    Convert a list of articles with sentiment scores into a NumPy array.

    Parameters:
        articles (list): List of dictionaries with 'sentiment_score' keys.

    Returns:
        np.ndarray: Array of sentiment scores.
    """
    # Extract the sentiment scores from each article
    sentiment_scores = [article.get('sentiment_score', 0) for article in articles]
    
    # Convert the sentiment scores to a NumPy array
    return np.array(sentiment_scores)


In [78]:
# Example usage
stock = 'AMZN'
period = '1y'
interval = '1d'

# Fetch stock data
stock_ohlc = get_stock_data(stock, period, interval)

# Calculate log returns
log_returns_arr = calculate_log_returns(stock_ohlc['Close'])

# Calculate technical indicators (RSI, EMA, VWAP, etc.)
technical_indicators_dict = calculate_technical_indicators(stock_ohlc)

# Convert date to string for the news API
first_date = stock_ohlc.index.min().strftime('%Y-%m-%d')
last_date = stock_ohlc.index.max().strftime('%Y-%m-%d')

# Get latest news articles and add sentiment scores
article_list = get_latest_news_yahoo(stock)
if isinstance(article_list, list):
    article_list_sent = add_sentiment_scores_to_articles(article_list)
    aggregated_sentiment_arr = articles_to_sentiment_arr(article_list_sent)
else:
    print(article_list)
    aggregated_sentiment_arr = np.array([])

# Add calculated fields to stock dataframe
stock_df = stock_ohlc
stock_df['log_returns'] = log_returns_arr
stock_df['rsi'] = technical_indicators_dict['rsi']
stock_df['ema_20'] = technical_indicators_dict['ema_20']
stock_df['vwap'] = technical_indicators_dict['vwap']
stock_df['normalized_volume'] = technical_indicators_dict['normalized_volume']

# Handle mismatch between stock data and sentiment array (apply padding or repetition)
if len(aggregated_sentiment_arr) < len(stock_df):
    # Use np.pad to fill missing values with NaN
    padded_sentiment_arr = np.pad(aggregated_sentiment_arr, (0, len(stock_df) - len(aggregated_sentiment_arr)), 'constant', constant_values=np.nan)
else:
    padded_sentiment_arr = aggregated_sentiment_arr[:len(stock_df)]

# Add the sentiment array to the stock DataFrame
stock_df['sentiment_score'] = padded_sentiment_arr

# Display the final dataframe
print(stock_df.head())


[*********************100%***********************]  1 of 1 completed


                  Open        High         Low       Close   Adj Close  \
Date                                                                     
2023-10-19  130.570007  132.240005  127.470001  128.399994  128.399994   
2023-10-20  128.050003  128.169998  124.970001  125.169998  125.169998   
2023-10-23  124.629997  127.879997  123.980003  126.559998  126.559998   
2023-10-24  127.739998  128.800003  126.339996  128.559998  128.559998   
2023-10-25  126.040001  126.339996  120.790001  121.389999  121.389999   

              Volume  log_returns  rsi  ema_20        vwap  normalized_volume  \
Date                                                                            
2023-10-19  60961400          NaN  NaN     NaN  129.370000                NaN   
2023-10-20  56343300    -0.025478  NaN     NaN  126.103333                NaN   
2023-10-23  48260000     0.011044  NaN     NaN  126.139999                NaN   
2023-10-24  46477400     0.015679  NaN     NaN  127.899999                Na