<a href="https://colab.research.google.com/github/Nagaraj-gt/fp1-stock-value-forecastor/blob/main/Stock_Predictions_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**SENTIMENT ANALYSIS OF COMPANY NEWS TO PREDICT STOCK VALUE**
---

# Team Number : 4

# Team Members : Ashwini, Madhav, Mrinal , Nagaraj, Sudhanshu 

## Remark<div class='tocSkip'/>

The stock value prediction is performed for Apple Inc. The data source is legitmate news source from,

1. Twitter API
2. Polygon Stock News API

The method used is Sentiment Analysis on the news documents. Its relative impact on price movements are captured.

## Setup<div class='tocSkip'/>

This step is intially done to setup notebook to run in dual mode - colab or local

In [2]:
import sys, os
ON_COLAB = 'google.colab' in sys.modules

if ON_COLAB:
    GIT_ROOT = 'https://github.com/Nagaraj-gt/fp1-stock-value-forecastor/raw/main'
    os.system(f'wget {GIT_ROOT}/code/setup.py')

%run -i setup.py

Preparing notebook to run on Colab
Files will be downloaded to "/content".
Downloading required files ...
!wget -P /content https://github.com/Nagaraj-gt/fp1-stock-value-forecastor/raw/main/settings.py
!wget -P /content/data https://github.com/Nagaraj-gt/fp1-stock-value-forecastor/raw/main/data/apple_news.csv
!wget -P /content/code https://github.com/Nagaraj-gt/fp1-stock-value-forecastor/raw/main/code/requirements.txt
!wget -P /content/packages https://github.com/Nagaraj-gt/fp1-stock-value-forecastor/raw/main/packages/preparation.py

Additional setup for torch, nltk and requirements
!pip install torch==1.7.0+cu101 torchvision==0.8.1+cu101 torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html
!pip install -r code/requirements.txt
!python -m nltk.downloader opinion_lexicon punkt stopwords averaged_perceptron_tagger wordnet
!python -m spacy download en


## Load Python Settings<div class="tocSkip"/>

Common imports, defaults for formatting in Matplotlib, Pandas etc.

In [3]:
# path to import packages
sys.path.append(BASE_DIR + '/packages')

import pandas as pd
import numpy as np
from sklearn import preprocessing
import nltk
nltk.download('opinion_lexicon')


[nltk_data] Downloading package opinion_lexicon to /root/nltk_data...
[nltk_data]   Package opinion_lexicon is already up-to-date!


True

# Sentiment Analysis

# Loading News Dataset for a given Ticker - Google (GOOGL)

In [4]:
## Initiation of values
from datetime import datetime, timedelta
import requests

given_ticker = "GOOGL"
from_date = datetime.today() - timedelta(days=365)
api_token = "CRmdYzOoLTkV2JlMpSWl0WSPuEXzuZvs"
ticker_news_api = "https://api.polygon.io/v2/reference/news"
ticker_stocks_api = "https://api.polygon.io/v2/aggs/ticker/"
limit = 1000

def get_response (url=ticker_news_api, params=None, headers=None ):

  print("URL is" , url)
  output_json = []
  new_list = []

  response = requests.get(url, params=params, headers=headers)

  print("Response Received : " , response)

  if(response.status_code == 200):
    output_json = response.json()
    news_list = output_json["results"]

  return news_list



In [5]:


news_output = get_response (
    
    ticker_news_api,

    params = {
      "apiKey" : api_token, 
      "published_utc.gt": from_date.date(),
      "ticker" : given_ticker,
      "limit": limit
     },

    headers = {"Accept": "application/json"}
)


stock_news_df = pd.DataFrame(news_output)

stock_news_df.to_csv("Google_Stock_News.csv")

0



URL is https://api.polygon.io/v2/reference/news
Response Received :  <Response [200]>


0

# Cleaning Data

In [6]:
from preparation import *
stock_news_df["title"].apply(clean)
## stock_news_df["description"].apply(clean)

0      Surprise! Warren Buffett "Owns" All 5 FAANG St...
1      How companies take advantage of our very human...
2      Gene Munster's Q2 Earnings Review With Benzing...
3      Sensible stock investors put their money on a ...
4      Mark Cuban says buying metaverse real estate i...
                             ...                        
995    Thinking Of Buying A Foldable Smartphone? Hold...
996                 Why Alphabet Stock Fell 18% in April
997    Grid Dynamics (GDYN) to Report Q1 Earnings: Wh...
998    These 3 Stocks Could Be Next on Warren Buffett...
999    4 Sector ETFs That Survived the Market Rout in...
Name: title, Length: 1000, dtype: object

# Sentiment Analysis using Lexicon based approaches

## Bing Liu Lexicon

In [7]:
from nltk.corpus import opinion_lexicon
from nltk.tokenize import word_tokenize

print('Total number of words in opinion lexicon', len(opinion_lexicon.words()))
print('Examples of positive words in opinion lexicon',
      opinion_lexicon.positive()[:5])
print('Examples of negative words in opinion lexicon',
      opinion_lexicon.negative()[:5])

Total number of words in opinion lexicon 6789
Examples of positive words in opinion lexicon ['a+', 'abound', 'abounds', 'abundance', 'abundant']
Examples of negative words in opinion lexicon ['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable']


In [8]:
# Let's create a dictionary which we can use for scoring our review text
# Please uncomment this line the first-time you run this code to download the vocabulary from nltk ###
nltk.download('punkt')

pos_score = 1
neg_score = -1
word_dict = {}

# Adding the positive words to the dictionary
for word in opinion_lexicon.positive():
        word_dict[word] = pos_score
        
# Adding the negative words to the dictionary
for word in opinion_lexicon.negative():
        word_dict[word] = neg_score
        
def bing_liu_score(text):
    sentiment_score = 0
    bag_of_words = word_tokenize(text.lower())
    for word in bag_of_words:
        if word in word_dict:
            sentiment_score += word_dict[word]
    return sentiment_score / len(bag_of_words)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [9]:
stock_news_df['Bing_Liu_Score'] = stock_news_df['title'].apply(bing_liu_score)
stock_news_df[['published_utc','title','Bing_Liu_Score']].sample(20, random_state=0)

Unnamed: 0,published_utc,title,Bing_Liu_Score
993,2022-05-04T12:05:00Z,3 Growth Stocks You Can Buy and Hold for the N...,0.0
859,2022-05-18T14:09:00Z,Fox (FOXA) News Digital Tops Brands in Multipl...,0.090909
298,2022-07-17T11:15:00Z,"1 Green Flag for Marqeta in 2022, and 1 Red Flag",0.0
553,2022-06-20T12:15:00Z,My Best FAANG Stock to Buy Now and Hold Forever,0.1
672,2022-06-06T22:02:00Z,Markets Cooler on a Slower News Day,-0.142857
971,2022-05-06T08:46:00Z,Zacks Investment Ideas feature highlights: Ama...,0.0
27,2022-08-09T11:51:00Z,Investors Like the News From These Companies,0.142857
231,2022-07-22T22:34:00Z,Google Stock Is A Steal,-0.2
306,2022-07-15T17:47:00Z,Is Alphabet A Buy After 20-For-1 Stock Split?,-0.111111
706,2022-06-02T15:36:00Z,Elon Musk wants Tesla staff to return to offic...,0.0


In [10]:
stock_news_df['Bing_Liu_Score'] = preprocessing.scale(stock_news_df['Bing_Liu_Score'])
stock_news_df['closing_date'] = pd.to_datetime(stock_news_df['published_utc']).dt.date

stock_news_df.groupby('closing_date').agg({'Bing_Liu_Score':'mean'})
stock_news_df[['closing_date','Bing_Liu_Score']].sample(20, random_state=0)
stock_news_df[['closing_date','Bing_Liu_Score']].to_csv("Day_Scores.csv")

# Preparing label for supervised Learning

The label will be the closing data stock price. This indicates how the news all day impacted the stock prices at end of the day.

In [11]:
# Get Stock prices

stock_output = get_response (
    
    ticker_stocks_api + given_ticker + "/range/1/day/" + from_date.strftime("%Y-%m-%d") +"/" + datetime.today().strftime("%Y-%m-%d"),

    params = {
      "apiKey" : api_token, 
      "limit": limit
     },

    headers = {"Accept": "application/json"}
)

stock_price_df = pd.DataFrame(stock_output)

stock_price_df.to_csv("Google_Stock_Price.csv")

stock_price_df["Date"] = pd.to_datetime(stock_price_df['t'],unit='ms').dt.date
stock_price_df["Closing_Price"] = stock_price_df["c"]
stock_price_df["bullish"] = np.where(stock_price_df['o'] < stock_price_df['c'], 1, 0)
stock_price_df_r = stock_price_df[["Date","o","c","bullish"]]
stock_price_df_r

URL is https://api.polygon.io/v2/aggs/ticker/GOOGL/range/1/day/2021-08-12/2022-08-12
Response Received :  <Response [200]>


Unnamed: 0,Date,o,c,bullish
0,2021-08-12,135.9755,137.1940,1
1,2021-08-13,137.2500,137.7275,1
2,2021-08-16,137.5249,138.3095,1
3,2021-08-17,137.7500,136.6615,0
4,2021-08-18,136.5000,135.4490,0
...,...,...,...,...
247,2022-08-05,116.2300,117.4700,1
248,2022-08-08,118.3900,117.3000,0
249,2022-08-09,117.1350,116.6300,0
250,2022-08-10,118.7800,119.7000,1


In [12]:
## Labelling the news articles

## Preparing Stock News Data for merge

stock_news_df['Date'] =  pd.to_datetime(stock_news_df['published_utc']).dt.date

stock_news_df_r = stock_news_df[['Date','title']]

stock_news_grouped = stock_news_df_r.groupby(['Date'])['title'].apply(' '.join)

stock_news_df_r = pd.DataFrame({'Date':stock_news_grouped.index , 'News': stock_news_grouped.values })

Labelled_Stock_News = pd.merge(stock_news_df_r,stock_price_df_r )

Labelled_Stock_News.to_csv("Labelled_Stock_News.csv")


In [54]:
stock_news_df_r.head(5)

Unnamed: 0,Date,News
0,2022-05-03,Why Alphabet Stock Fell 18% in April Grid Dyna...
1,2022-05-04,"AMD Beats on Q1 Earnings & Revenues, Provides ..."
2,2022-05-05,3 Reasons Investors Shouldn't Lose Sleep Over ...
3,2022-05-06,Defense Wins Ballgames; 3 Stocks That Will Bul...
4,2022-05-07,Apple has spent decades building its walled ga...


# Supervised Learning Approaches

## Preparing data for a supervised learning approach

In [13]:
pd.set_option('display.max_rows', None)  ###
pd.set_option('display.max_columns', None)  ###
pd.set_option('display.width', None)  ###
pd.set_option('display.max_colwidth', None)  ###

# Blueprint: Vectorizing text data and applying a supervised machine learning algorithm

## Step 1 - Data Preparation

In [14]:
from preparation import clean
Labelled_Stock_News['News_orig'] = Labelled_Stock_News['News'].copy()
Labelled_Stock_News['News'] = Labelled_Stock_News['News'].apply(clean)

In [15]:

# This can take longer to run due to the size of the dataset!
import textacy
import spacy
from spacy.lang.en import STOP_WORDS as stop_words
nlp = spacy.load('en')

def extract_lemmas(doc, **kwargs):
    return [t.lemma_ for t in textacy.extract.words(doc,
                                                    filter_stops = False,
                                                    filter_punct = True,
                                                    filter_nums = True,
                                                    include_pos = ['ADJ', 'NOUN', 'VERB', 'ADV'],
                                                    exclude_pos = None,
                                                    min_freq = 1)]

def clean_text(text):
    doc = nlp(text)
    lemmas = extract_lemmas(doc)
    return ' '.join(lemmas)

In [16]:
# Alternate method that uses Wordnet POS tags instead of spaCy - can run faster with similar accuracy
# Tokenization and Lemmatization using wordnet. Re-uses parts of blueprint from Chapter 4
# Uses wordnet POS tags instead of spaCy
# return the wordnet object value corresponding to the POS tag
from nltk.corpus import wordnet

def get_wordnet_pos(pos_tag):
    if pos_tag.startswith('J'):
        return wordnet.ADJ
    elif pos_tag.startswith('V'):
        return wordnet.VERB
    elif pos_tag.startswith('N'):
        return wordnet.NOUN
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN
    
import string
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.tokenize import WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')

def clean_text(text):
    # lower text
    text = text.lower()
    # tokenize text and remove puncutation
    text = [word.strip(string.punctuation) for word in text.split(" ")]
    # remove words that contain numbers
    text = [word for word in text if not any(c.isdigit() for c in word)]
    # remove stop words
    stop = stopwords.words('english')
    text = [x for x in text if x not in stop]
    # remove empty tokens
    text = [t for t in text if len(t) > 0]
    # pos tag text
    pos_tags = pos_tag(text)
    # lemmatize text
    text = [WordNetLemmatizer().lemmatize(t[0], get_wordnet_pos(t[1])) for t in pos_tags]
    # remove words with only one letter
    text = [t for t in text if len(t) > 1]
    # join all
    text = " ".join(text)
    return(text)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [17]:
Labelled_Stock_News["News"] = Labelled_Stock_News["News"].apply(clean_text)

## Remove observations that are empty after the cleaning step
Labelled_Stock_News = Labelled_Stock_News[Labelled_Stock_News['News'].str.len() != 0]

## Step 2 - Train-Test Split

In [18]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(Labelled_Stock_News['News'],
                                                    Labelled_Stock_News['bullish'],
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=Labelled_Stock_News['bullish'])

print ('Size of Training Data ', X_train.shape[0])
print ('Size of Test Data ', X_test.shape[0])

print ('Distribution of classes in Training Data :')
print ('Bullish Sentiment ', str(sum(Y_train == 1)/ len(Y_train) * 100.0))
print ('Bearish Sentiment ', str(sum(Y_train == 0)/ len(Y_train) * 100.0))

print ('Distribution of classes in Testing Data :')
print ('Bullish Sentiment ', str(sum(Y_test == 1)/ len(Y_test) * 100.0))
print ('Bearish Sentiment ', str(sum(Y_test == 0)/ len(Y_test) * 100.0))

Size of Training Data  56
Size of Test Data  14
Distribution of classes in Training Data :
Bullish Sentiment  58.92857142857143
Bearish Sentiment  41.07142857142857
Distribution of classes in Testing Data :
Bullish Sentiment  57.14285714285714
Bearish Sentiment  42.857142857142854


## Step 3 - Text Vectorization

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(min_df = 10, ngram_range=(1,1))
X_train_tf = tfidf.fit_transform(X_train)
X_test_tf = tfidf.transform(X_test)

## Step 4 - Training the Machine Learning model

In [20]:
from sklearn.svm import LinearSVC

model1 = LinearSVC(random_state=42, tol=1e-5)
model1.fit(X_train_tf, Y_train)

LinearSVC(random_state=42, tol=1e-05)

In [21]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

Y_pred = model1.predict(X_test_tf)
print ('Accuracy Score - ', accuracy_score(Y_test, Y_pred))
print ('ROC-AUC Score - ', roc_auc_score(Y_test, Y_pred))

Accuracy Score -  0.7142857142857143
ROC-AUC Score -  0.7083333333333334


In [22]:
sample_reviews = Labelled_Stock_News.sample(5, random_state=22)
sample_reviews_tf = tfidf.transform(sample_reviews['News'])
sentiment_predictions = model1.predict(sample_reviews_tf)
sentiment_predictions = pd.DataFrame(data = sentiment_predictions,
                                     index=sample_reviews.index,
                                     columns=['sentiment_prediction'])
sample_reviews = pd.concat([sample_reviews, sentiment_predictions], axis=1)
print ('Some sample reviews with their sentiment - ')
sample_reviews[['News_orig','sentiment_prediction']]

Some sample reviews with their sentiment - 


Unnamed: 0,News_orig,sentiment_prediction
10,"No College Degree? Apple, Tesla And More Companies Say 'No Problem' Palantir: Dead Cat Bounce Grubhub is giving NYC office workers free lunch --- but who’s still in the office to get the deal? How To Use S&P 500 Futures To Predict Market Movement Tech Market Sell-Off: 4 Stocks Down Big That Are Long-Term Holds Microsoft Vs. Google: Which Is The Better Buy? Could This Growth Stock Make it Big in the Streaming Market? Google's Pixel Watch Is Reportedly Using Outdated Chip 2 Cryptocurrencies That Could Dwarf Dogecoin Is Alphabet a Buy After Q1 Earnings? Google and PayPal are hiding most of their carbon footprint in their bank accounts",0
67,"The Trade Desk Rockets Higher on Blockbuster Performance -- Is the Stock a Buy? Justice Department expected to file antitrust lawsuit against Google as soon as September: report Are Netflix Subscribers Playing Its Video Games? Here's The Latest Data Why This Investor Thinks Airbnb's Stock Buyback Plan Is a ""Horrible"" Idea Why Outset Medical Is Worth a Look These 18 tech stocks are this earnings season's standouts based on three criteria Investors Like the News From These Companies 3 Tech Stocks That Are Screaming Buys in August Did You Miss What Microsoft Had to Say? Checking In on Caterpillar, Arista Networks, and Pinterest Is Google Cheap? 3 Ways To Value The Company Google Search suffers rare but brief outage Monday night No, You Can't Google That: Search Engine Down For Thousands Of Users On Monday Night",0
48,"Synchronoss (SNCR) Offers Support for Alibaba & Google Cloud Wall Street Says Not to Buy These Stocks, But Here's Why I Am Stock-Split Watch: Is Adobe Next? Amazon (AMZN) Boosts AWS Offerings With Cloud WAN Availability The tourists have fled the energy space. One bank says they may have bolted too soon. 2 Growth Stocks to Buy in a Bear Market If You Had Invested $1,000 in Alphabet in 2010, This Is How Much You Would Have Today Is Alphabet Stock a Buy Now Before the 20-for-1 Stock Split? Apple price target cut to $175 at Citi, which offers five reasons to keep buying the stock Google Parent To Slow Down Hiring Amid Global Economic Crisis Google: The Old Adage Of Time Vs. Timing Tracking Mario Gabelli's Gabelli Funds 13F Portfolio - Q1 2022 Update",0
22,"3 Stocks To Buy In This Market Madness Why Is Uber (UBER) Down 7.8% Since Last Earnings Report? 4 Metrics Proving Roku's Ready for a Bull Run Why stocks' bounce feels more like 'a typical bear-market rally than the start of something new and prosperous' Separating Signal From Noise in the Metaverse I've never liked Facebook stock -- and these 4 warnings beyond Sheryl Sandberg's departure are why you shouldn't either Is MongoDB Stock a Buy After Crushing Earnings? Apple stock suffers first 'death cross' chart pattern since the pandemic After three years of promises, attempt to regulate tech comes down to a single bill 'Completely F**king Incompetent:' Russian Soldiers Bash Vladimir Putin, Close Aides In Intercepted Audio We've seen this movie before — the biggest tech-stock gains are still ahead of us A guide for new entrepreneurs on marketing your startup with social media Google Searches For Direction As It Narrowly Misses Target GM's Cruise Becomes California's First Driverless Ride Service",0
55,"Google Stock Is A Steal Is Alphabet (GOOGL) a Buy Heading into Q2 Earnings Announcement? S&P 500 Trims Weekly Gain Following Better-Than-Feared Earnings Reports Previewing Big Tech Earnings Ahead of a Huge Week for Wall Street An Inside Look at the Rule Breakers Investing Mindset Expert Ratings for Alphabet Mizuho Slashes Google's Price Target By 14% To Reflect Headwinds For Its Cloud, Advertising Businesses Snap's 'grim outlook' sends stock on near-40% skid, slaps other social names Snap Stock Crash Is an Opportunity, but Not One You Might Think Tech Getting Hit After Snap's Earnings, Oil Price Dropping And Other Top Headlines For July 22 Here's the Next Stock-Split Stock to Buy After Alphabet Google: Buy The Split 5 Big Tech Stocks Report Earnings Next Week: Here’s What To Expect UBS isn't yet a believer in the rally. Here's its advice on what to do now. 2 Stock-Split Stocks to Buy Hand Over Fist and 1 to Avoid Like the Plague",1


In [23]:
def baseline_scorer(text):
    score = bing_liu_score(text)
    if score > 0:
        return 1
    else:
        return 0
    
Y_pred_baseline = X_test.apply(baseline_scorer)
acc_score = accuracy_score(Y_pred_baseline, Y_test)
print (acc_score)

0.35714285714285715


### Saving the trained model and vectorizer for use with the API later

In [24]:
import pickle

pickle.dump(model1, open('models/stock_news_classification.pickle','wb'))
pickle.dump(tfidf, open('models/stock_news_vectorizer.pickle','wb'))

FileNotFoundError: ignored

# Pre-trained Language Models using deep learning

## Deep Learning and Transfer Learning


# Blueprint: using transfer learning technique and a pre-trained language model

In [None]:
# This is an optional step to reduce the size of the data by sampling only 40% of the observations
# It is very useful to conduct a first run using a GPU (on Google Colab)
# Lager number of observations can cause longer runtime and automatic shutdown on the Colab free instance
#df = df.sample(frac=0.4, random_state=42)

## Step 1: Loading models and tokenization

In [25]:
from transformers import BertConfig, BertTokenizer, BertForSequenceClassification

config = BertConfig.from_pretrained('bert-base-uncased', finetuning_task='binary')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [26]:
# There is a change in behavior of the truncation while calling the encode function. 
# This produces a warning and the behavior will probably change in future
# Currently supress the warning as described - https://github.com/huggingface/transformers/issues/5397
import warnings; ###
warnings.filterwarnings('ignore'); ###

def get_tokens(text, tokenizer, max_seq_length, add_special_tokens=True):
  input_ids = tokenizer.encode(text, 
                               add_special_tokens=add_special_tokens, 
                               max_length=max_seq_length,
                               pad_to_max_length=True)
  attention_mask = [int(id > 0) for id in input_ids]
  assert len(input_ids) == max_seq_length
  assert len(attention_mask) == max_seq_length
  return (input_ids, attention_mask)


In [27]:
X_train, X_test, Y_train, Y_test = train_test_split(Labelled_Stock_News['News_orig'],
                                                    Labelled_Stock_News['bullish'],
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=Labelled_Stock_News['bullish'])
X_train_tokens = X_train.apply(get_tokens, args=(tokenizer, 50))
X_test_tokens = X_test.apply(get_tokens, args=(tokenizer, 50))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [28]:
import torch
from torch.utils.data import TensorDataset

input_ids_train = torch.tensor(
    [features[0] for features in X_train_tokens.values], dtype=torch.long)
input_mask_train = torch.tensor(
    [features[1] for features in X_train_tokens.values], dtype=torch.long)
label_ids_train = torch.tensor(Y_train.values, dtype=torch.long)

print (input_ids_train.shape)
print (input_mask_train.shape)
print (label_ids_train.shape)

torch.Size([56, 50])
torch.Size([56, 50])
torch.Size([56])


In [29]:
input_ids_train[2]

tensor([  101, 12440,  1006, 27571, 23296,  1007, 12154,  2004,  3006, 16510,
         2015,  1024,  2054,  2017,  2323,  2113,  2339,  8066,  2480,  4518,
         7569,  2000,  1037,  4840,  2035,  1011,  2051,  2659,  2651,  8224,
         4518,  3975,  2003,  4844,  1024,  2054,  1005,  1055,  1996,  4518,
        17680,  1029,  1016,  6031, 28305,  2102, 15768,  2000,  4965,   102])

In [30]:
train_dataset = TensorDataset(input_ids_train,input_mask_train,label_ids_train)

In [31]:
input_ids_test = torch.tensor([features[0] for features in X_test_tokens.values], 
                              dtype=torch.long)
input_mask_test = torch.tensor([features[1] for features in X_test_tokens.values], 
                               dtype=torch.long)
label_ids_test = torch.tensor(Y_test.values, 
                              dtype=torch.long)
test_dataset = TensorDataset(input_ids_test, input_mask_test, label_ids_test)

## Step 2 - Model Training

In [32]:
from torch.utils.data import DataLoader, RandomSampler

train_batch_size = 64
num_train_epochs = 2

train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, 
                              sampler=train_sampler, 
                              batch_size=train_batch_size)
t_total = len(train_dataloader) * num_train_epochs

print ("Num training examples = ", len(train_dataset))
print ("Train batch size  = ", train_batch_size)
print ("Num training steps in an epoch = ", len(train_dataloader))
print ("Num Epochs = ", num_train_epochs)
print ("Total num training steps = ", t_total)

Num training examples =  56
Train batch size  =  64
Num training steps in an epoch =  1
Num Epochs =  2
Total num training steps =  2


In [33]:
from transformers import AdamW, get_linear_schedule_with_warmup

learning_rate = 1e-4
adam_epsilon = 1e-8
warmup_steps = 0

optimizer = AdamW(model.parameters(), lr=learning_rate, eps=adam_epsilon)
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=warmup_steps, 
                                            num_training_steps=t_total)

In [34]:
from tqdm import trange, notebook

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
train_iterator = trange(num_train_epochs, desc="Epoch")

## Put model in 'train' mode
model.train()
    
for epoch in train_iterator:
    epoch_iterator = notebook.tqdm(train_dataloader, desc="Iteration")
    for step, batch in enumerate(epoch_iterator):

        ## Reset all gradients at start of every iteration
        model.zero_grad()
        
        ## Put the model and the input observations to GPU
        model.to(device)
        batch = tuple(t.to(device) for t in batch)
        
        ## Identify the inputs to the model
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2]}

        ## Forward Pass through the model. Input -> Model -> Output
        outputs = model(**inputs)

        ## Determine the deviation (loss)
        loss = outputs[0]
        print("\r%f" % loss, end='')

        ## Back-propogate the loss (automatically calculates gradients)
        loss.backward()

        ## Prevent exploding gradients by limiting gradients to 1.0 
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        ## Update the parameters and learning rate
        optimizer.step()
        scheduler.step()

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

0.694005

Epoch:  50%|█████     | 1/2 [00:27<00:27, 27.27s/it]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

0.715776

Epoch: 100%|██████████| 2/2 [00:52<00:00, 26.45s/it]


In [None]:
model.save_pretrained('outputs')

## Step 3 - Model Evaluation


In [35]:
import numpy as np
from torch.utils.data import SequentialSampler

test_batch_size = 64
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, 
                             sampler=test_sampler, 
                             batch_size=test_batch_size)

# Load the pre-trained model that was saved earlier 
# model = model.from_pretrained('/outputs')

# Initialize the prediction and actual labels
preds = None
out_label_ids = None

## Put model in "eval" mode
model.eval()

for batch in notebook.tqdm(test_dataloader, desc="Evaluating"):
    
    ## Put the model and the input observations to GPU
    model.to(device)
    batch = tuple(t.to(device) for t in batch)
    
    ## Do not track any gradients since in 'eval' mode
    with torch.no_grad():
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2]}

        ## Forward pass through the model
        outputs = model(**inputs)

        ## We get loss since we provided the labels
        tmp_eval_loss, logits = outputs[:2]

        ## There maybe more than one batch of items in the test dataset
        if preds is None:
            preds = logits.detach().cpu().numpy()
            out_label_ids = inputs['labels'].detach().cpu().numpy()
        else:
            preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
            out_label_ids = np.append(out_label_ids, 
                                      inputs['labels'].detach().cpu().numpy(), 
                                      axis=0)
    
## Get final loss, predictions and accuracy
preds = np.argmax(preds, axis=1)
acc_score = accuracy_score(preds, out_label_ids)
print ('Accuracy Score on Test data ', acc_score)

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Accuracy Score on Test data  0.5714285714285714


# Closing Remarks

The model was trained on 2 models. The LinearSVC accuracy score was around 35% . However with pretrained model of basic BERT, the accuracy score increased substntially to 57%