Importing data files, to perform sentiment analysis

Load the HTML file for each stock into memory:

In [None]:
# Import libraries
from bs4 import BeautifulSoup
import os

html_tables = {}

# For every table in the datasets folder...
for table_name in os.listdir('datasets'):
    #this is the path to the file. Don't touch!
    table_path = f'datasets/{table_name}'
    # Open as a python file in read-only mode
    with open(table_path,'r') as table_file:
        # Read the contents of the file into 'html'
        html = BeautifulSoup(table_file, 'html.parser')
        # Find 'news-table' in the Soup and load it into 'html_table'
        html_table = html.find(id='news-table')
        # Add the table to our dictionary
        html_tables[table_name] =  html_table

To understand how the data in that table is structured, exploring the headlines table 

In [None]:
# Read one single day of headlines 
tsla = html_tables['tsla_22sep.html']
# Get all the table rows tagged in HTML with <tr> into 'tesla_tr'
tesla = tsla.find('tr')
tsla_tr = tsla.find_all('tr')
# For each row...
for i, table_row in enumerate(tsla_tr):
    # Read the text of the element 'a' into 'link_text'
    link_text = table_row.a.get_text()
    # Read the text of the element 'td' into 'data_text'
    data_text = table_row.td.get_text()
    # Print the count
    print(f'File number {i+1}:')
    # Print the contents of 'link_text' and 'data_text' 
    print(link_text)
    print(data_text)
    # The following exits the loop after four rows to prevent spamming the notebook, do not touch
    if i == 3:
        break

To extract the news lines:
Extract key information from each stock's BeautifulSoup object.

In [None]:
# Hold the parsed news into a list
parsed_news = []
# Iterate through the news
for file_name, news_table in html_tables.items():
    # Iterate through all tr tags in 'news_table'
    for x in news_table.find_all('tr'):
        # Read the text from the tr tag into text
        text = x.get_text() 
        # Split the text in the td tag into a list 
        date_scrape = x.td.text.split(' ')
        # If the length of 'date_scrape' is 1, load 'time' as the only element
        # If not, load 'date' as the 1st element and 'time' as the second
        if len(date_scrape) == 1:
            time = date_scrape[0]
            #print("time is the oly object:", time)
        else:
            date, time = date_scrape
            print("time:", time)
            print("date:", date)
            
        

        # Extract the ticker from the file name, get the string up to the 1st '_'  
        ticker = file_name.split('_')[0]
       
        # Append ticker, date, time and headline as a list to the 'parsed_news' list
        parsed_news.append([ticker, date, time, text])

Sentiment analysis:
updating the vader source code with new words, to perform sentiment analysis on a financial document.

In [None]:
# NLTK VADER for sentiment analysis
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# New words and values
new_words = {
    'crushes': 10,
    'beats': 5,
    'misses': -5,
    'trouble': -10,
    'falls': -100,
}
# Instantiate the sentiment intensity analyzer with the existing lexicon
# Create a SentimentIntensityAnalyzer object.
vader =  SentimentIntensityAnalyzer()
# Update the lexicon
vader.lexicon.update(new_words)

Creating a new dataframe, which conatins ticker, date, time , headline and the vader score of the header:

In [None]:
import pandas as pd
# Use these column names
columns = ['ticker', 'date', 'time', 'headline']
# Convert the list of lists into a DataFrame
scored_news = pd.DataFrame(parsed_news, columns=columns)
# Iterate through the headlines and get the polarity scores
# polarity_scores method of SentimentIntensityAnalyzer
# object gives a sentiment dictionary.
# which contains pos, neg, neu, and compound scores.

scores = scored_news['headline'].apply(vader.polarity_scores).to_list()
  
print(scores)
# Convert the list of dicts into a DataFrame
scores_df = pd.DataFrame(scores)
# Join the DataFrames
scored_news = scored_news.join(scores_df)
print(scored_news)
# Convert the date column from string to datetime
scored_news['date'] = pd.to_datetime(scored_news.date).dt.date

Plotting the time series for the stocks 

In [None]:
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
%matplotlib inline

# Group by date and ticker columns from scored_news and calculate the mean
mean_c = scored_news.groupby(['ticker', 'date'])
mean_c = mean_c.mean()
mean_c.head(3)

Unstacking the column 'ticker' is necessary to prepare the data for further analysis. The dataset used in this project has a column for 'headline', 'date', and 'ticker', where each row represents a news headline related to a specific stock ticker on a particular date.

To perform sentiment analysis on the news headlines, we need to aggregate the headlines by ticker and date so that we can analyze the sentiment of the news related to each stock. Unstacking the 'ticker' column allows us to convert the dataset from a long format to a wide format, where each ticker has its own column, and the news headlines for each date are organized in rows under the respective ticker column.

This format makes it easier to calculate sentiment scores for each ticker on each date, which can then be used to analyze the sentiment trend of each stock over time. It also makes it easier to compare the sentiment trends of different stocks by placing them side by side in separate columns.

Overall, unstacking the 'ticker' column is an essential step in preparing the data for further analysis and extracting insights from the news headlines related to each stock.

In [None]:
# Unstacking the column 'ticker'

# Unstack the column ticker
mean_c = mean_c.unstack('ticker')
mean_c.head(3)

In [None]:
# Get the cross-section of compound in the 'columns' axis
mean_c = mean_c.xs('compound', axis=1)

In [None]:
# Plot a bar chart with pandas
mean_c.plot.bar(title='Compound vadar polarity scores')