### This model will be used to predict the sentiment for news articles on Crude Oil
### A Gaussian Naive Bayes classifier has been used
### The process will be divided into 3 phases: 
### 1. Obtaining data from OilPrice.com
### 2. Training the model using 90% of this data by using the vaderSentiment NLP library
### The library will be used to predict whether the article sentiment is positive or negative.
### 3. Testing the model by using the remaining 10% of the data 

In [412]:
#Install vaderSentiment, a popular NLP library developed by Georgia Tech
!pip install vaderSentiment



In [413]:
# import necessary libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import GaussianNB
import pickle
import numpy as np

print("Required libraries successfully installed")

Required libraries successfully installed


In [414]:
# Create lists to store scraped news urls, headlines and text
url_list = []
news_text = []
headlines = [] 

In [415]:
# Get the url's for all articles which will be used to create the training dataset

for i in range(1,8): #parameters of range function correspond to page numbers in the website with news listings
    #get the list of unique urls in the page
    url = 'https://oilprice.com/Energy/Crude-Oil/Page-{}.html'.format(i)
    request = requests.get(url)
    soup = BeautifulSoup(request.text, "html.parser")
    for links in soup.find_all('div', {'class': 'categoryArticle'}):
        for info in links.find_all('a'):
            if info.get('href') not in url_list: # Avoid repeats
                url_list.append(info.get('href'))
print("A total of",len(url_list),"articles obtained")

A total of 140 articles obtained


In [416]:
# Get the data from within each article and store into lists 
# The headline of the article along with the text within the article will be scraped

for www in url_list:
    #access each url
    headlines.append(www.split("/")[-1].replace('-',' '))
    request = requests.get(www)
    soup = BeautifulSoup(request.text, "html.parser")
    
    #store the text of the news
    temp_news = []
    for news in soup.find_all('p'):
            temp_news.append(news.text)
            
    # We don't need the author name which is given at the end of each article 
    # We find where this is located and trim our temp_news list 
    #find the last line of the news article we need 
    for last_sentence in reversed(temp_news):
        if last_sentence.split(" ")[0]=="By" and last_sentence.split(" ")[-1]=="Oilprice.com":
            break
        elif last_sentence.split(" ")[0]=="By":
            break
            
    temp_news=temp_news[temp_news.index("More Info")+1:temp_news.index(last_sentence)]
    news_article_joined=' '.join(temp_news)
    news_text.append(news_article_joined)

In [417]:
# save news text along with the news headline in a dataframe      
news_df = pd.DataFrame({ 'Headline': headlines,
                         'News': news_text,
                       })

In [418]:
# use VADER to perform sentiment analysis on stored news articles
news_sentiment = SentimentIntensityAnalyzer()

def comp_score(text):
   return news_sentiment.polarity_scores(text)["compound"]   
  
news_df["sentiment"] = news_df["News"].apply(comp_score)

In [419]:
# Classifying the sentiment score into positive, negative, or neutral
news_df.loc[news_df['sentiment'] >=0.1, 'Sentiment_Classification'] = 'positive'
news_df.loc[news_df['sentiment']<=-0.1,'Sentiment_Classification']= 'negative'
news_df['Sentiment_Classification']=news_df['Sentiment_Classification'].fillna('neutral')

In [420]:
#This is the dataset that will be used for the prediction model
news_df.head()
news_df.to_csv("CrudeOil_News_Articles.csv",index=False) 

### Next step is the vectorization of data
### The news articles will be used for this
### All stop words will be dropped
### The pickle library will be used to store vectorized model for reuse in future
### Term feature- Inverse Document feature method will be used to extract important features from the document 


In [702]:
# Read the data from the CSV file
data=pd.read_csv("CrudeOil_News_Articles.csv", encoding = "ISO-8859-1")

In [703]:
#Some of the articles did not return any data, we need to drop these columns before we vectorize the data
data.dropna(inplace=True)
data.reset_index(drop=True,inplace=True)

#We need the news article text data, which is located in column 1
news_data=data.iloc[:,1]

In [704]:
#Create a vectorize object 
vectorize_object = CountVectorizer(stop_words='english')

In [705]:
news_data_vec = vectorize_object.fit_transform(news_data)
pickle.dump(vectorize_object, open("crude_oil_data_vectorize", 'wb')) # Save vectorizer for reuse

In [706]:
## convert sparse matrix into dense matrix
news_data_vec = news_data_vec.todense()

In [707]:
# Transform data by applying term frequency inverse document frequency (TF-IDF) 
# This is important because we do not want the model to consider very frequently occurring words 
# (for e.g. oil) which do not affect the sentiment but occur frequently because they are subjects 
# By using TF-IDF, we reduce the weighting of these words
# We then normalize the data so that vector values are between -1 and 1
tfi = TfidfTransformer() 
news_data_tfi = tfi.fit_transform(news_data_vec)
news_data_tfi = news_data_tfi.todense()

In [708]:
# Merge the tfi vector with the data vector
news_data_tfi_df=pd.DataFrame(news_data_tfi)
news_data_tfi_df=news_data_tfi_df.join(data)

### Now that the data vector is ready, we can split it into train and test 
### For this case, a 9:1 split is taken, i.e. 90% of the data is used for training 

In [709]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(news_data_tfi_df, test_size=0.1)

### Now let's train the data

In [710]:
x_train=train.iloc[:,0:7603]
y_train=train.iloc[:,-1]

# Train the NB classifier
classifier = GaussianNB().fit(x_train, y_train) 
pickle.dump(classifier, open("nb_clf_crude_oil", 'wb')) # Save the classifier model for reuse in future

### Now let's test the data

In [711]:
x_test=test.iloc[:,0:7603]
y_test=test.iloc[:,-1]

In [712]:
# Use the model to make a sentiment prediciton 
y_pred=classifier.predict(x_test)

### Predicted model

In [713]:
# This is the predicted model
y_prediction=pd.DataFrame(y_pred)
y_prediction.rename(columns={0:"Sentiment_Classification"},inplace=True)
y_prediction

Unnamed: 0,Sentiment_Classification
0,positive
1,negative
2,positive
3,positive
4,negative
5,positive
6,positive
7,negative
8,negative
9,positive


### These are the results of the sentiment in the test data

In [714]:
# This is the actual sentiment which was calculated using Vader
y_testing=pd.DataFrame(y_test)
y_testing.reset_index(inplace=True, drop=True)
y_testing

Unnamed: 0,Sentiment_Classification
0,positive
1,positive
2,negative
3,positive
4,negative
5,positive
6,positive
7,negative
8,negative
9,positive


### Transform the sentiments into a +1 value for positive and -1 value for negative to calculate model accuracy
### Then convert the values columns to numpy arrays

In [715]:
y_prediction.loc[y_prediction['Sentiment_Classification']=='positive', 'value'] = 1
y_prediction.loc[y_prediction['Sentiment_Classification']=='negative','value']= -1

y_testing.loc[y_testing['Sentiment_Classification']=='positive', 'value'] = 1
y_testing.loc[y_testing['Sentiment_Classification']=='negative','value']= -1

In [716]:
error_test= np.asanyarray(y_testing[['value']]) 
error_pred = np.asanyarray(y_prediction[['value']])

In [717]:
#Evaluating the model accuracy
from sklearn import metrics 
import matplotlib.pyplot as plt 
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(error_test, error_pred))

DecisionTrees's Accuracy:  0.8461538461538461


### The trained model gives an accuracy of 84.6% on the test data
### This means that it can be a good estimator of the sentiment in crude oil articles
### However, the model could have the following drawbacks: The Vader library may be incapable of evaluating certain finance specific sentiments (for e.g. bullish, bearish). To solve this problem, a lot more emphasis needs to be put on the data collection. A possible way of obtaining data would be to send out surveys containing statements or phrases and asking finance professionals to give a score in a range of -5 to +5 (-5 being most negative and +5 being most positive). This data could then be used to develop the model and will definitely provide more accurate insights. 