## 1920-DSS2100 Skills for Data Science 2: Project 

The following is a write-up of a project I undertook for my 'Skills for Data Science 2: Project' module in second year Arts with Data Science at NUIG.
This semester, I improved my skills in programming with Python with particular respect to the Pandas module and various NLP modules. I primarily learned through the online-learning website DataCamp.com.
I chose to complete and write up the project "Extract Stock Sentiment from News Headlines" on DataCamp for this module, as I wanted to develop my NLP skills in particular.

In this project I use both natural language processing methods and dataframe manipulation in Pandas to analyse stock sentiment.

In [None]:

# Import BeautifulSoup, which we shall use to obtain our data.
from bs4 import BeautifulSoup
import os


# Initialize our dictionary, which we will use to create our dataset.
html_tables = {}

# Create a for loop that will iterate over our datasets and fill our dictionary with tables and their corresponding names.
for table_name in os.listdir('datasets'):
    table_path = f'datasets/{table_name}'
    table_file = open(table_path, 'r')
    # Open this table using BeautifulSoup and read it into a variable 'html'.
    html = BeautifulSoup(table_file)
    # Find all tables with the id 'news-table' as this is the data that we are interested in.
    html_table = html.find(id='news-table')
    # Add this table to our dictionary with value 'html_table'.
    html_tables[table_name] = html_table
    

The above code is using BeautifulSoup to acquire our data. This step is important to get correct as it is the foundational data that we will be using throughout the rest of this code. Using BeautifulSoup gives us access to tools that allow us to quickly sift through the documents and locate and store the data we want to use.

In this particular instance, we used the code to grab news headlines from HTML files for stocks from FINVIZ.com. For the next step, a small bit of exploratory data analysis is required in order to develop an understanding of what data we are handling.

In [None]:

#Read a single day of headlines, to keep it simple.
tsla = html_tables['tsla_22sep.html']
# Get all the table rows tagged in HTML with <tr> into 'tsla_tr' using BeautifulSoup.
tsla_tr = tsla.findAll('tr')

# For each row in our BeautifulSoup object:
for i, table_row in enumerate(tsla_tr):
    
    # Read the text of the element 'a' into a variable.
    link_text = table_row.a.get_text()
    # Read the text of the element 'td' into a variable.
    data_text = table_row.td.get_text()
    # Print the count.
    print(f'File number {i+1}:')
    # Print the contents of 'link_text' and 'data_text'.
    print(link_text)
    print(data_text)
    # Prevent spam, plus this is enough data to get an idea of what we are dealing with.
    if i == 3:
        break


We can see from our output that our data has a few main components, such as a headline and a timestamp. The next goal is to parse this data and extract the news headlines, as this is the data we want to use for our stock sentiment analysis. We do so using the following code:

In [None]:

# Initialize an empty list to hold our parsed news.
parsed_news = []

# Iterate through the news in the data we obtained.
for file_name, news_table in html_tables.items():
    
    # Iterate through all the elements with the tr tag in 'news_table'.
    for x in news_table.findAll('tr'):
        # Read the text in each element with tag 'tr' into text.
        text = x.get_text() 
        # Split the text in the td tag into a list. 
        date_scrape = x.td.text.split()
        # If the length of 'date_scrape' is 1, load 'time' as the only element.
        # If not, load 'date' as the 1st element and 'time' as the second.
        # Numerous articles can be released in one day. This is why this condition is necessary.
        if len(date_scrape) == 1:
            time = date_scrape[0]
        else:
            date = date_scrape[0]
            time = date_scrape[1]
            
        # Extract the ticker from the file name, get the string up to the 1st '_'.  
        ticker = file_name.split('_')[0]
        # Append ticker, date, time and headline as a list to the 'parsed_news' list.
        parsed_news.append([ticker, date, time, x.a.get_text()])


We have now generated a list of lists which we will use to create our dataframe, which will become the basis of our stock sentiment analysis. We have the basis for the data that we are going to be analysing, now its time to get started on the tools that we'll be using.

Our next task is to import SentimentIntensityAnalyzer from nltk.vader.sentiment and to update its lexicon in such a way that it is thinking more like a financial journalist. This last part is crucial to the sentiment analysis: We don't want our algorithm to misinterprete headlines due to a misunderstanding of the meaning behind the words used. We assign values to certain words that the algorithm will then be able to use to assign polarity scores to the news headlines for us using the following code:

In [None]:

# The algorithm we are using to conduct the sentiment analysis.
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# The new words and values that we are going to teach the algorithm.
new_words = {
    'crushes': 10,
    'beats': 5,
    'misses': -5,
    'trouble': -10,
    'falls': -100,
}

# Instantiate the sentiment intensity analyzer with the already existing lexicon.
vader = SentimentIntensityAnalyzer()
# Update the lexicon with our new words and values that we want our algorithm to use.
vader.lexicon.update(new_words)


We have now imported the algorithm we plan to use for our sentiment analysis and updated it with some vocabulary and associated values that we want it to use. We can now prepare our data, using Pandas, and then use our algorithm to generate the polarity scores.

In [None]:

# Import pandas.
import pandas as pd

columns = ['ticker', 'date', 'time', 'headline']
# Convert the list of lists we created by parsing the text files into a DataFrame.
scored_news = pd.DataFrame(parsed_news)
scored_news.columns = columns
# We then iterate our algorithm over each headline, generating the polarity scores and storing them in 'score'.
scores = [vader.polarity_scores(headline) for headline in scored_news['headline']]
# Convert the list of dictionaries we generated into another DataFrame.
scores_df = pd.DataFrame(scores)
# Join our first DataFrame to this DataFrame, generating a new DataFrame with all of our data, including polarity scores.
scored_news = scored_news.join(scores_df)
# Convert the date column from string to datetime.
scored_news['date'] = pd.to_datetime(scored_news.date).dt.date


Now we have our data. The next step is cleaning and processing this data to a point that we can use it for our sentiment analysis, which we do by a bit of dataframe manipulation using pandas. We can plot the average sentiment of our tickers against time after this.

In [None]:
# Import and prepare modules.
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
%matplotlib inline

# Group by date and ticker columns from scored_news and then calculate the mean.
mean_c = (scored_news.groupby(['date', 'ticker'])).mean()
# Unstack the column ticker.
mean_c = mean_c.unstack('ticker')
# Get the cross-section of 'compound' in the 'columns' axis.
mean_c = mean_c.xs('compound', axis='columns')
# Plot a bar chart with pandas.
mean_c.plot.bar()


A quick look at the bar chart reveals that something is a little off. There is a sharp spike in the graph on November 22nd. Looking back through our data, we can see that there were only 5 headlines from that day, in addition to the fact that two headlines are exactly the same as one another. We need to clean our data up a bit in order to amend these small discrepancies in order to increase the accuracy of the results of our analysis.

In [None]:

# Count the number of headlines in scored_news (store as integer).
num_news_before = len(scored_news)
# Drop duplicates based on ticker and headline. This cleans the data.
scored_news_clean = scored_news.drop_duplicates(subset=['headline', 'ticker'])
# Count number of headlines after dropping duplicates.
num_news_after = len(scored_news_clean)
# Print before and after numbers to compare the old data to the clean data.
f"Before we had {num_news_before} headlines, now we have {num_news_after}"


Our dataset is now cleaned and ready for usage. Much can be learned from the dataset. The following code is an example of the type of analysis that can be done with it:

In [None]:

# Set the index to ticker and date.
single_day = scored_news_clean.set_index(['ticker', 'date'])
# Cross-section the fb row.
single_day = single_day.xs('fb', level='ticker')
# Select the 3rd of January of 2019.
single_day = single_day.loc['2019-01-03']
# Convert the datetime string to just the time.
single_day['time'] = pd.to_datetime(single_day['time']).dt.time
# Set the index to time.
single_day = single_day.set_index('time')
# Sort it.
single_day = single_day.sort_index()


Here, we've extracted a small amount of the data, just some from a single day. We can then plot the data and visualise the positive, negative and neutral scores from one day of trading, according to our dataset.

In [None]:

TITLE = "Negative, neutral, and positive sentiment for FB on 2019-01-03"
COLORS = ["red","orange", "green"]
# Drop the columns that aren't useful for the plot.
plot_day = single_day.drop(['headline', 'compound'])
print(plot_day)
# Change the column names to 'negative', 'positive', and 'neutral'.
plot_day.columns = ['negative', 'positive', 'neutral']
# Plot a stacked bar chart.
plot_day.plot.bar(stacked = True, 
                  figsize=(10, 6), 
                  title = TITLE, 
                  color = COLORS).legend(bbox_to_anchor=(1.2, 0.5))


In summary, we have created a dataset using an NLTK algorithm and news headlines that we extracted from HTML files that, after a bit of cleaning, can be used to analyse the prevailing sentiment of the stock market on Facebook and Tesla stocks. The code uses BeautifulSoup, Pandas and NLTK to extract, store and analyse data in order to achieve this.