## 1. Extracting Stock information from HTML files
<p>In today’s world, news articles are often generated <em>automatically</em>  from financial figures and earnings call transcripts. Both hedge funds and independent traders leverage data science to sift through the vast amounts of information in financial news, aiming to profit from it.</p>
<p>In this notebook, we will explore investment insights by performing sentiment analysis on financial news headlines from <a href="https://finviz.com">FINVIZ.com</a>. Sentiment analysis is a method of natural language processing that helps us understand the emotions conveyed in the headlines. This understanding allows us to predict the market’s sentiment towards specific stocks—whether positive or negative. With this information, we can make informed predictions about how stocks might perform and make trading decisions that could potentially lead to profits. (And hopefully, make money!)</p>
<p><img src="Assets\tsla articles.png" alt="Facebook headlines from FINVIZ.com"></p>

<p>Why focus on headlines, particularly those from FINVIZ?</p>
<ol>
<li>Headlines are generally of similar length, making them easier to analyze and categorize than full articles, which can vary significantly in length.</li>
<li>FINVIZ is known for sourcing its headlines from reputable websites, ensuring a level of consistency in the language used, unlike headlines from independent bloggers. This consistency in language helps in achieving more accurate sentiment analysis results.</li>
</ol>
<p>We will begin by scraping HTML files for META and Tesla from FINVIZ and importing these into our dataset for analysis.</p>

In [5]:
!pip install beautifulsoup4
!pip install nltk
!pip install pandas
!pip install requests
!pip install openai
!pip install nest_asyncio




[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip






[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
import requests
import os
from datetime import datetime

def download_html(url, filename):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    response = requests.get(url, headers=headers)
    response.raise_for_status()  # This will raise an exception for HTTP errors
    with open(filename, 'wb') as f:
        f.write(response.content)

# URLs for TSLA and META
urls = {
    'TSLA': 'https://finviz.com/quote.ashx?t=TSLA&ty=c&ta=1&p=d',
    'META': 'https://finviz.com/quote.ashx?t=META&ty=c&ta=1&p=d',
    'GOOG': 'https://finviz.com/quote.ashx?t=Goog&ty=c&ta=1&p=d',
    'AAPL': 'https://finviz.com/quote.ashx?t=AAPL&ty=c&ta=1&p=d',
    'KO':   'https://finviz.com/quote.ashx?t=KO&ty=c&ta=1&p=d',
    'AMZN': 'https://finviz.com/quote.ashx?t=AMZN&ty=c&ta=1&p=d',
    'JPM': 'https://finviz.com/quote.ashx?t=JPM&ty=c&ta=1&p=d',
    'MSFT': 'https://finviz.com/quote.ashx?t=MSFT&ty=c&ta=1&p=d',
    'SBUX': 'https://finviz.com/quote.ashx?t=SBUX&ty=c&ta=1&p=d',
    'NVDA': 'https://finviz.com/quote.ashx?t=NVDA&ty=c&ta=1&p=d',
    'AMD': 'https://finviz.com/quote.ashx?t=AMD&ty=c&ta=1&p=d',
    
}

# Ensure the datasets folder exists
os.makedirs('datasets', exist_ok=True)

# Get the current date and time in a suitable format
now = datetime.now().strftime("%Y-%m-%d-%H-%M-%S")

# Download and save each HTML file with a timestamp in the filename
for ticker, url in urls.items():
    filename = f'datasets/{ticker}-{now}.html'
    download_html(url, filename)
    print(f'Downloaded and saved {ticker} page to {filename}')

Downloaded and saved TSLA page to datasets/TSLA-2024-05-14-10-35-19.html
Downloaded and saved META page to datasets/META-2024-05-14-10-35-19.html
Downloaded and saved GOOG page to datasets/GOOG-2024-05-14-10-35-19.html
Downloaded and saved AAPL page to datasets/AAPL-2024-05-14-10-35-19.html
Downloaded and saved KO page to datasets/KO-2024-05-14-10-35-19.html
Downloaded and saved AMZN page to datasets/AMZN-2024-05-14-10-35-19.html
Downloaded and saved JPM page to datasets/JPM-2024-05-14-10-35-19.html
Downloaded and saved MSFT page to datasets/MSFT-2024-05-14-10-35-19.html
Downloaded and saved SBUX page to datasets/SBUX-2024-05-14-10-35-19.html
Downloaded and saved NVDA page to datasets/NVDA-2024-05-14-10-35-19.html
Downloaded and saved AMD page to datasets/AMD-2024-05-14-10-35-19.html


In [2]:
# Import libraries
from bs4 import BeautifulSoup
import os

html_tables = {}

# For every table in the datasets folder...
for table_name in os.listdir('datasets'):
    #this is the path to the file. Don't touch!
    table_path = f'datasets/{table_name}'
    # Open as a python file in read-only mode
    table_file = open(table_path, 'r')
    # Read the contents of the file into 'html'
    html = BeautifulSoup(table_file)
    # Find 'news-table' in the Soup and load it into 'html_table'
    html_table = html.find(id='news-table')
    # Add the table to our dictionary
    html_tables[table_name] = html_table

## 2. What is inside these files
<p>We've grabbed the table that contains the headlines from each stock's HTML file, but before we start parsing those tables further, we need to understand how the data in that table is structured.</p>
<p>Let's Explore the headlines table here below</p>

In [4]:
# Read one single day of headlines 
tsla = html_tables['TSLA-2024-05-14-10-35-19.html']
# Get all the table rows <tr> in the file into 'tesla_tr'
tsla_tr = tsla.findAll('tr')

# For each row...
for i, table_row in enumerate(tsla_tr):
    # Read the text of the element 'a' into 'link_text'
    link_text = table_row.a.get_text() 
    # Read the text of the element <td> into 'data_text'
    data_text = table_row.td.get_text()
    # Print the count
    print(f'File number {i+1}:')
    # Print the contents of 'link_text' and 'data_text' 
    print(link_text)
    print(data_text)
    # The following exits the loop after four rows to prevent spamming the notebook, do not touch
    if i == 3:
        break

File number 1:
McDonalds, Apple and Tesla cant bet on making a fortune in China anymore

            May-13-24 10:04PM
        
File number 2:
Don't use an Amazon adapter to charge your electric car at a Tesla Supercharger

            04:40PM
        
File number 3:
Warren Buffett Says Tesla Achieving Full Self-Driving Would Be "Good For Society And Bad For Insurance Companies Volume"

            04:30PM
        
File number 4:
These Stocks Moved the Most Today: GameStop, AMC, Arm, Squarespace, Walgreens, Tesla, Intel, Apple, and More

            04:25PM
        


## 3. OpenAI GPT-Based Sentiment Analysis for Financial News
<p>Sentiment analysis is very sensitive to context. The reason we chose headlines is so we can try to extract sentiment from financial journalists, who like most professionals, have their own lingo. To address the unique challenges of financial sentiment analysis, we have adapted the OpenAI GPT model to interpret and analyze financial news. This involves:</p>
<ol>
<li>Custom Prompts: We use tailored prompts that instruct the model to consider the financial context, enhancing its ability to parse and understand the subtle nuances in financial news headlines.</li>
<li>Quantitative Sentiment Scoring: The model provides detailed sentiment scores, categorizing sentiments into positive, negative, and neutral. Additionally, it computes a compound score that aggregates these sentiments into a single metric, ranging from -1 (extremely negative) to 1 (extremely positive), providing a quick, overall sentiment outlook.</li>
</ol>

## Extracting the news headlines
<p>The interesting data inside each table row (<code>&lt;tr&gt;</code>) is in the text inside the <code>&lt;td&gt;</code> and <code>&lt;a&gt;</code> tags.<ol>
<li>Our code parses the data from html tables created before for <strong>all</strong> tables in a comfortable DataFrame structure. (<strong>parse_news(html_tables)</strong>)</li>
<li>Then it saves "scored_news.csv" contents to a DataFrame (<strong>load_existing_data()</strong>) and</li> 
<li>after standartizing dates, (<strong>parse_date(date_str)</strong>)</li>
<li>it concatenates New data with news processed before and drops duplicates by 'headline' and 'date' columns, while keeping <strong>processed</strong>. (<strong>main())</strong></li>
<li>Resulted DataFrame with only unique headlines are handled to <strong>process_headlines(dataframe)</strong> function that sends <strong>'unprocessed'</strong> headlines to the next function...</li>
<li><strong>get_sentiment(text)</strong> takes the headline and sends each headline inside a prepared promt to Open AI GPT Model, with returns "response_text"</li>
<li><strong>parse_sentiment(response_text)</strong> function takes the responce of the model and uses string masks to parce "sentiment_scores": "pos", "neg", "neu" and "compound" and save the result to a dictionary and it is handled back to <strong>get_sentiment(text)</strong> function</li>
<li>If the data doesn't allign with masks or rulles, the repeated attempt is forwarded back to get_sentiment to make another request to Open AI GPT model</li>
<li>If scores are successfully parced, they are appended to "dataframe" and status is changed to "processed" in the <strong>process_headlines(dataframe)</strong> function that awaited the result in the stack.</li>
<li>The result is then again concatenated to "scored_news" DataFrame and writen back to 'scored_news.csv' in the (<strong>main())</strong> function</p>

## Implementation for High Throughput News Analysis
<p>Utilizing asyncio for asynchronous processing, our solution efficiently handles large volumes of data in real-time, crucial for trading platforms and financial analysts who require up-to-the-minute sentiment analysis. 
<p>After receving Response text from OpenAI GPT model, we will get to parsing out the sentiment scores for each headline and uploading in to csv.</p>
<p>For future iterations, incorporating a dedicated financial lexicon will further refine the model’s accuracy. Developing an extensive, finance-specific sentiment dictionary will allow for even more nuanced analysis, essential for predictive models in trading algorithms and market trend analysis.</p>

In [3]:
import asyncio
from openai import AsyncOpenAI
import pandas as pd
import time
from datetime import datetime, timedelta
import os
import nest_asyncio
import re

nest_asyncio.apply()

client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
semaphore = asyncio.Semaphore(10)  # Adjust based on your system's capability

def parse_sentiment(response_text):
    print(response_text)
    pos_match = re.search(r'Positive[^:]*:\s*([\d\.]+)', response_text, re.I)
    neg_match = re.search(r'Negative[^:]*:\s*([\d\.]+)', response_text, re.I)
    neu_match = re.search(r'Neutral[^:]*:\s*([\d\.]+)', response_text, re.I)
    compound_match = re.search(r'Compound[^:]*:\s*([-\d\.]+)', response_text, re.I)

    # Default to 0 if no match found or if the match does not contain a digit (this handles the '-')
    sentiment_scores = {
        'pos': float(pos_match.group(1)) if pos_match and pos_match.group(1).replace('.', '', 1).isdigit() else 0,
        'neg': float(neg_match.group(1)) if neg_match and neg_match.group(1).replace('.', '', 1).isdigit() else 0,
        'neu': float(neu_match.group(1)) if neu_match and neu_match.group(1).replace('.', '', 1).isdigit() else 0,
        'compound': float(compound_match.group(1)) if compound_match and compound_match.group(1).replace('.', '', 1).replace('-', '', 1).isdigit() else 0,
    }
    print(sentiment_scores)
    return sentiment_scores

async def get_sentiment(text):
    async with semaphore:
        backoff_time = 1  # Start with a 1-second backoff
        max_attempts = 5  # Set a maximum number of retry attempts
        for attempt in range(max_attempts):
            try:
                prompt = (
                    f"You are a financial journalist that consider the financial context and understand the subtle nuances in financial news headlines. Analyze the sentiment of this headline and compute numeric sentiment scores. Each score should be a numeric value representing the strength "
                    f"and proportion of sentiments in the headline: '{text}'. Calculate these based on the count and intensity of sentimentally charged words categorized as positive, negative, and neutral, dynamically prioritize sentiments based on the contextual strength or weakness of justifications, possibly using additional NLP techniques like parsing for modifiers or negations. The final scores should reflect the proportion and strength of these sentiments as follows: a score of positive sentiment (0 to 1), a score of negative sentiment (0 to 1). Explicitly provide values for positive, negative, and neutral sentiments (e.g., 0.1 positive), Explicitly provide values such as '0.1 positive', and prioritize a higher score for neutral sentiment with lower for positive in cases where the justification for positive sentiment is weak. For example, if a text contains mostly mild or ambiguous positive words without strong affirmative language, the neutral score should be emphasized over the positive score."
                    f"Include a compound score from -1 to 1, indicating overall sentiment, where -1 is extremely negative, and 1 is extremely positive. Also do not write explanations or other excessive information."
                    f"If a certain sentiment is not present, return '0' instead of non-numeric placeholders."
                )
                completion = await client.chat.completions.create(
                    messages=[{"role": "user", "content": prompt}],
                    model="gpt-3.5-turbo",
                )
                return parse_sentiment(completion.choices[0].message.content)
            except Exception as e:
                if '429' in str(e):  # Handle rate limit
                    print(f"Rate limit hit, retrying after {backoff_time} seconds...")
                    await asyncio.sleep(backoff_time)
                    backoff_time *= 2  # Double backoff time for each retry
                else:
                    print(f"Error processing sentiment: {e}")
                    break  # Break on other errors after logging
        print("Max retry attempts reached or an error occurred, request failed.")
        return {'pos': 0.0, 'neg': 0.0, 'neu': 0.0, 'compound': 0.0}

async def process_headlines(dataframe):
    tasks = []
    indices = []  # Keep track of indices for updating the dataframe after tasks complete
    for index, row in dataframe.iterrows():
        if row['status'] == 'unprocessed':
            tasks.append(get_sentiment(row['headline']))
            indices.append(index)  # Store the index to update status later

    if tasks:  # Check if there are any tasks to run
        sentiments = await asyncio.gather(*tasks)
        for idx, sentiment in zip(indices, sentiments):
            # Update the row with new sentiment scores
            dataframe.loc[idx, ['pos', 'neg', 'neu', 'compound']] = list(sentiment.values())
            dataframe.at[idx, 'status'] = 'processed'
        return dataframe
    else:
        print("No new headlines to process.")
        return dataframe

def load_existing_data():
    try:
        df = pd.read_csv('scored_news.csv')
        # Apply 'parse_date' to each date in the DataFrame
        df['date'] = df['date'].apply(parse_date)
        return df
    except FileNotFoundError:
        return pd.DataFrame(columns=['ticker', 'date', 'time', 'headline', 'status', 'pos', 'neg', 'neu', 'compound', 'status'])

def parse_date(date_str):
    # Check for special cases like "Today"
    if date_str.lower() == "today":
        return datetime.now().date()  # Return today's date
    
    # This function tries different date formats and returns the parsed date
    date_formats = [
        "%Y-%m-%d",  # e.g., 2024-05-07
        "%d.%m.%Y",  # e.g., 07.05.2024
        "%b-%d-%y"   # e.g., May-05-24
    ]
    for fmt in date_formats:
        try:
            return datetime.strptime(date_str, fmt).date()
        except ValueError:
            continue
    raise ValueError(f"Date format for {date_str} is not supported")

def parse_news(html_tables):
    existing_data = load_existing_data()
    new_data = []
    for file_name, news_table in html_tables.items():
        for x in news_table.findAll('tr'):
            date_scrape = x.td.text.split()
            if len(date_scrape) == 1:
                time = date_scrape[0]
                date = datetime.now().date()  # Use datetime.date for consistent date object
            else:
                try:
                    date = parse_date(date_scrape[0])  # Use the new parse_date function
                    time = date_scrape[1]
                except ValueError as e:
                    print(e)  # Log the error
                    continue  # Skip this entry on failure
            ticker = file_name.split("_")[0]
            headline = x.a.text.strip()
            if not ((existing_data['headline'] == headline) & (existing_data['date'] == date)).any():
                new_data.append({'ticker': ticker, 'date': date, 'time': time, 'headline': headline, 'status': 'unprocessed'})
    return pd.DataFrame(new_data)

async def main():
    # First, parse news and update DataFrame with new entries
    global scored_news
    parsed_news = parse_news(html_tables)
    existing_data = load_existing_data()
    scored_news = pd.concat([existing_data, parsed_news]).drop_duplicates(['headline', 'date'], keep='last')
    
    # Process headlines marked as 'unprocessed'
    scored_news = await process_headlines(scored_news)
    scored_news['ticker'] = scored_news['ticker'].str[:7]
    scored_news.to_csv('scored_news.csv', index=False)
    print(scored_news)

asyncio.run(main())

Positive: 0
Negative: 0
Neutral: 0.9
Compound: 0
{'pos': 0.0, 'neg': 0.0, 'neu': 0.9, 'compound': 0.0}
Positive: 0.2

Negative: 0

Neutral: 0.8

Compound: 0.2
{'pos': 0.2, 'neg': 0.0, 'neu': 0.8, 'compound': 0.2}
Positive: 0.3
Negative: 0.2
Neutral: 0.5
Compound: 0.1
{'pos': 0.3, 'neg': 0.2, 'neu': 0.5, 'compound': 0.1}
Positive: 0.6
Negative: 0
Neutral: 0.4
Compound: 0.6
{'pos': 0.6, 'neg': 0.0, 'neu': 0.4, 'compound': 0.6}
Positive: 0.2
Negative: 0
Neutral: 0.8
Compound: 0.2
{'pos': 0.2, 'neg': 0.0, 'neu': 0.8, 'compound': 0.2}
Positive: 0
Negative: 0.5
Neutral: 0.5
Compound: -0.5
{'pos': 0.0, 'neg': 0.5, 'neu': 0.5, 'compound': -0.5}
Positive: 0.3
Negative: 0
Neutral: 0.7
Compound: 0.2
{'pos': 0.3, 'neg': 0.0, 'neu': 0.7, 'compound': 0.2}
Positive: 0.3
Negative: 0
Neutral: 0.7
Compound: 0.3
{'pos': 0.3, 'neg': 0.0, 'neu': 0.7, 'compound': 0.3}
Positive: 0.4
Negative: 0.4
Neutral: 0.2
Compound: 0.0
{'pos': 0.4, 'neg': 0.4, 'neu': 0.2, 'compound': 0.0}
Positive sentiment: 0.2
Negative

## 6. Uploading to Power BI for cleaning, transformation and visualization

<p>Then Data set is uploaded to Power BI. There we will clean up the dataset a bit. While some headlines are the same news piece from different sources, the fact that they are written differently could provide different perspectives on the same story. Plus, when one piece of news is more important, it tends to get more headlines from multiple sources. What we want to get rid of is verbatim copied headlines, as these are very likely coming from the same journalist and are just being "forwarded" around.</p>
<p>To enhance the report's comprehensiveness, I have incorporated stock price information from Yahoo Finance. Additionally, I have included tables of Insider Trades and Analyst Actions for each stock, sourced from the FINVIZ page. This integration provides a more detailed and informative view of each stock's financial activities.</p>

<p>This is a resulted report: <a href="https://app.powerbi.com/view?r=eyJrIjoiMjgwNDA5ZjYtZDkxMi00OGI3LTkxNzktZjNiMzgxZThlYWY0IiwidCI6IjJhNTQzZDQ1LWE5NzItNDQ3NC05ZDUzLWRjZjFhOTdlMTYyMyIsImMiOjl9&pageName=ReportSection0368f6c5982076648298">app.powerbi.com/view...</a> </p>

<p>Please, note that I do not plan to update it frequently, therefore, you may need to select older dates in the relative date filter to see the last data.</p>