# Daily News Stock Market Prediction

---

**Table of Contents** 

[1.0 Objectives](#1.0-Objectives)  
[2.0 Import Library](#2.0-Import-Library)  
[3.0 Set Constant and Default Settings](#3.0-Set-Constant-and-Default-Settings)  
[4.0 Load Dataset](#4.0-Load-Dataset) 

---

# 1.0 Objectives

- To predict the stock market movement with daily news

# 2.0 Import Library

In [2]:
# System
import os

# EDA
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [305]:
# Deep learning
import tensorflow as tf

# Preprocessing
from sklearn.model_selection import TimeSeriesSplit
from sklearn.feature_extraction.text import CountVectorizer

# Metrics
from sklearn.metrics import roc_auc_score, plot_roc_curve
from tensorflow.keras.metrics import AUC

# Model
from sklearn.linear_model import LogisticRegression
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense

# Optimisation
from tensorflow.keras.callbacks import EarlyStopping

# 3.0 Set Constant and Default Settings

In [4]:
plt.rcParams['figure.dpi'] = 150
sns.set_style('dark')

In [5]:
base_dir = os.path.join('/', 'kaggle', 'input', 'stocknews')

# Check is kaggle env or local env
is_kaggle = os.path.exists(base_dir)

dataset_path = os.path.join(base_dir if is_kaggle else '', 'Combined_News_DJIA.csv')

# 4.0 Load Dataset

In [181]:
df = pd.read_csv(dataset_path, parse_dates=['Date'], index_col='Date')

As feature *date* is used as index, we will sort the dataset in chronological order

In [182]:
df = df.sort_index()

# 5.0 Overview of Dataset

In [183]:
df.head()

Unnamed: 0_level_0,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,Top9,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",b'Georgian troops retreat from S. Osettain cap...,...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,b'Welcome To World War IV! Now In High Definit...,...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...","b""The US military was surprised by the timing ...",...,b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."
2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',"b""The commander of a Navy air reconnaissance s...",...,b'Elephants extinct by 2020?',b'US humanitarian missions soon in Georgia - i...,"b""Georgia's DDOS came from US sources""","b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...
2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",b'Russia exaggerating South Ossetian death tol...,...,b'Bank analyst forecast Georgian crisis 2 days...,"b""Georgia confict could set back Russia's US r...",b'War in the Caucasus is as much the product o...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...


# 6.0 Split dataset into train and test set

Since the dataset is a time series, the trivial approach to split the dataset into train and test set randomly is not applicable in time series. The time orderly of the dataset is important as the real life event are happened in chronological order.

20% of the dataset will be reserved for test set.

In [184]:
dataset_size = len(df)

train_size_index = int(dataset_size * 0.8)

In [204]:
train_df = df[:train_size_index]
test_df = df[train_size_index:]

# 7.0 Exploratory Data Analysis (EDA)

## 7.1 Statistics Summary of Dataset

In [205]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1989 entries, 2008-08-08 to 2016-07-01
Data columns (total 26 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   1989 non-null   int64 
 1   Top1    1989 non-null   object
 2   Top2    1989 non-null   object
 3   Top3    1989 non-null   object
 4   Top4    1989 non-null   object
 5   Top5    1989 non-null   object
 6   Top6    1989 non-null   object
 7   Top7    1989 non-null   object
 8   Top8    1989 non-null   object
 9   Top9    1989 non-null   object
 10  Top10   1989 non-null   object
 11  Top11   1989 non-null   object
 12  Top12   1989 non-null   object
 13  Top13   1989 non-null   object
 14  Top14   1989 non-null   object
 15  Top15   1989 non-null   object
 16  Top16   1989 non-null   object
 17  Top17   1989 non-null   object
 18  Top18   1989 non-null   object
 19  Top19   1989 non-null   object
 20  Top20   1989 non-null   object
 21  Top21   1989 non-null   object
 22  Top22 

In [206]:
df.columns

Index(['Label', 'Top1', 'Top2', 'Top3', 'Top4', 'Top5', 'Top6', 'Top7', 'Top8',
       'Top9', 'Top10', 'Top11', 'Top12', 'Top13', 'Top14', 'Top15', 'Top16',
       'Top17', 'Top18', 'Top19', 'Top20', 'Top21', 'Top22', 'Top23', 'Top24',
       'Top25'],
      dtype='object')

In [207]:
label_weight_perc = train_df['Label'].value_counts(normalize=True) * 100

label_weight_perc

1    54.242615
0    45.757385
Name: Label, dtype: float64

The feature *label* will be the target for training the model. The ratio between 0 and 1 is fairly balanced.

## 7.2 Data Cleaning

In [208]:
train_df.isnull().sum()

Label    0
Top1     0
Top2     0
Top3     0
Top4     0
Top5     0
Top6     0
Top7     0
Top8     0
Top9     0
Top10    0
Top11    0
Top12    0
Top13    0
Top14    0
Top15    0
Top16    0
Top17    0
Top18    0
Top19    0
Top20    0
Top21    0
Top22    0
Top23    1
Top24    3
Top25    3
dtype: int64

Feature *Top23*, *Top24*, *Top25* contain missing values. However, it isn't necessary to deal with these instances as some day will not have so many hot news. We can fill in the missing value as empty string.

In [209]:
train_df = train_df.fillna('')

In [210]:
train_df.head()

Unnamed: 0_level_0,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,Top9,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",b'Georgian troops retreat from S. Osettain cap...,...,b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,b'Welcome To World War IV! Now In High Definit...,...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...","b""The US military was surprised by the timing ...",...,b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."
2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',"b""The commander of a Navy air reconnaissance s...",...,b'Elephants extinct by 2020?',b'US humanitarian missions soon in Georgia - i...,"b""Georgia's DDOS came from US sources""","b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...
2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",b'Russia exaggerating South Ossetian death tol...,...,b'Bank analyst forecast Georgian crisis 2 days...,"b""Georgia confict could set back Russia's US r...",b'War in the Caucasus is as much the product o...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...


# 8.0 Preprocessing

## 8.1 Data Cleaning

In [211]:
type(train_df['Top1'][0])

str

At first glance, the string start with "b'...'" and intuitively the data type is byte-sequence. However, it isn't the case. Just by executing the `type()` function, we can know that the string data type is already in string. Thus, we need to do data cleaning on the data by removing the byte-sequence format that been encoded into the string.

In [212]:
def clean_byte_str(df: pd.DataFrame, cols: list):
    strip_ch = "b\'\""

    return df[cols].apply(lambda x: x.str.strip(strip_ch), axis=1)

In [213]:
txt_cols = [f'Top{x}' for x in range(1, 26)]

In [214]:
preprocessed_txt_df = clean_byte_str(train_df, txt_cols)

train_df = train_df.drop(txt_cols, axis=1).merge(preprocessed_txt_df, how='outer', left_index=True, right_index=True)

## 8.2 Aggregate Texts Feature Column

We will be aggregate the column from *Top1* to *Top25* as feature *TopNews*.

In [215]:
train_df.head()

Unnamed: 0_level_0,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,Top9,...,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2008-08-08,0,Georgia 'downs two Russian warplanes' as count...,BREAKING: Musharraf to be impeached.,Russia Today: Columns of troops roll into Sout...,Russian tanks are moving towards the capital o...,"Afghan children raped with 'impunity,' U.N. of...",150 Russian tanks have entered South Ossetia w...,"Breaking: Georgia invades South Ossetia, Russi...",The 'enemy combatent' trials are nothing but a...,Georgian troops retreat from S. Osettain capit...,...,Georgia Invades South Ossetia - if Russia gets...,Al-Qaeda Faces Islamist Backlash,"Condoleezza Rice: ""The US would not act to pre...",This is a busy day: The European Union has ap...,"Georgia will withdraw 1,000 soldiers from Iraq...",Why the Pentagon Thinks Attacking Iran is a Ba...,Caucasus in crisis: Georgia invades South Ossetia,Indian shoe manufactory - And again in a seri...,Visitors Suffering from Mental Illnesses Banne...,No Help for Mexico's Kidnapping Surge
2008-08-11,1,Why wont America and Nato help us? If they won...,Bush puts foot down on Georgian conflict,Jewish Georgian minister: Thanks to Israeli tr...,Georgian army flees in disarray as Russians ad...,Olympic opening ceremony fireworks 'faked,What were the Mossad with fraudulent New Zeala...,Russia angered by Israeli military sale to Geo...,An American citizen living in S.Ossetia blames...,Welcome To World War IV! Now In High Definition!,...,Israel and the US behind the Georgian aggression?,"Do not believe TV, neither Russian nor Georgia...",Riots are still going on in Montreal (Canada) ...,China to overtake US as largest manufacturer,War in South Ossetia [PICS],Israeli Physicians Group Condemns State Torture,Russia has just beaten the United States over...,Perhaps *the* question about the Georgia - Rus...,Russia is so much better at war,So this is what it's come to: trading sex for ...
2008-08-12,0,Remember that adorable 9-year-old who sang at ...,Russia 'ends Georgia operation,If we had no sexual harassment we would have n...,Al-Qa'eda is losing support in Iraq because of...,Ceasefire in Georgia: Putin Outmaneuvers the West,Why Microsoft and Intel tried to kill the XO $...,Stratfor: The Russo-Georgian War and the Balan...,I'm Trying to Get a Sense of This Whole Georgi...,The US military was surprised by the timing an...,...,U.S. troops still in Georgia (did you know the...,Why Russias response to Georgia was right,"Gorbachev accuses U.S. of making a ""serious bl...","Russia, Georgia, and NATO: Cold War Two",Remember that adorable 62-year-old who led you...,War in Georgia: The Israeli connection,All signs point to the US encouraging Georgia ...,Christopher King argues that the US and NATO a...,America: The New Mexico?,BBC NEWS | Asia-Pacific | Extinction 'by man n...
2008-08-13,0,U.S. refuses Israel weapons to attack Iran: r...,When the president ordered to attack Tskhinval...,Israel clears troops who killed Reuters camer...,"Britain\'s policy of being tough on drugs is ""...",Body of 14 year old found in trunk; Latest (ra...,China has moved 10 *million* quake survivors i...,Bush announces Operation Get All Up In Russia'...,Russian forces sink Georgian ships,The commander of a Navy air reconnaissance squ...,...,Elephants extinct by 2020?,US humanitarian missions soon in Georgia - if ...,Georgia's DDOS came from US sources,"Russian convoy heads into Georgia, violating t...",Israeli defence minister: US against strike on...,Gorbachev: We Had No Choice,Witness: Russian forces head towards Tbilisi i...,Quarter of Russians blame U.S. for conflict: ...,Georgian president says US military will take...,2006: Nobel laureate Aleksander Solzhenitsyn a...
2008-08-14,1,All the experts admit that we should legalise ...,War in South Osetia - 89 pictures made by a Ru...,Swedish wrestler Ara Abrahamian throws away me...,Russia exaggerated the death toll in South Oss...,Missile That Killed 9 Inside Pakistan May Have...,Rushdie Condemns Random House's Refusal to Pub...,Poland and US agree to missle defense deal. In...,"Will the Russians conquer Tblisi? Bet on it, n...","Russia exaggerating South Ossetian death toll,...",...,Bank analyst forecast Georgian crisis 2 days e...,Georgia confict could set back Russia's US rel...,War in the Caucasus is as much the product of ...,"Non-media"" photos of South Ossetia/Georgia con...",Georgian TV reporter shot by Russian sniper du...,Saudi Arabia: Mother moves to block child marr...,Taliban wages war on humanitarian aid workers,"Russia: World ""can forget about"" Georgia\'s t...",Darfur rebels accuse Sudan of mounting major a...,Philippines : Peace Advocate say Muslims need ...


In [216]:
 train_df['TopNews'] = train_df[txt_cols].apply('; '.join, axis=1)

 train_df = train_df.drop(txt_cols, axis=1)

In [217]:
train_df.head()

Unnamed: 0_level_0,Label,TopNews
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2008-08-08,0,Georgia 'downs two Russian warplanes' as count...
2008-08-11,1,Why wont America and Nato help us? If they won...
2008-08-12,0,Remember that adorable 9-year-old who sang at ...
2008-08-13,0,U.S. refuses Israel weapons to attack Iran: r...
2008-08-14,1,All the experts admit that we should legalise ...


In [218]:
train_df.iloc[0]['TopNews']

'Georgia \'downs two Russian warplanes\' as countries move to brink of war; BREAKING: Musharraf to be impeached.; Russia Today: Columns of troops roll into South Ossetia; footage from fighting (YouTube); Russian tanks are moving towards the capital of South Ossetia, which has reportedly been completely destroyed by Georgian artillery fire; Afghan children raped with \'impunity,\' U.N. official says - this is sick, a three year old was raped and they do nothing; 150 Russian tanks have entered South Ossetia whilst Georgia shoots down two Russian jets.; Breaking: Georgia invades South Ossetia, Russia warned it would intervene on SO\'s side; The \'enemy combatent\' trials are nothing but a sham: Salim Haman has been sentenced to 5 1/2 years, but will be kept longer anyway just because they feel like it.; Georgian troops retreat from S. Osettain capital, presumably leaving several hundred people killed. [VIDEO]; Did the U.S. Prep Georgia for War with Russia?; Rice Gives Green Light for Isra

# 9.0 Model Evaluation Metrics

As the dataset target is fairly balanced and it is a binary classification problem. We will be using the **Area Under the Receiver Operating Characteristic Curve (ROC AUC)** score to evalute our model.

We will be also using the `TimeSeriesSplit()` method from *scikit-learn* library to perform cross-validation on the train dataset. It is because the dataset is time sensitive. 

In [254]:
model_eval = {
    'model': [],
    'roc_auc': [],
}

def add_model_eval(model, roc_auc_list: list):
    roc_auc = np.array(roc_auc_list).mean()

    model_eval['model'].append(model)
    model_eval['roc_auc'].append(f'{roc_auc: .2f}')
    
def view_models_eval(sort=False):
    eval_df = pd.DataFrame(model_eval)
    
    if sort:
        eval_df = eval_df.sort_values(by=['roc_auc'], ascending=[False])
    
    display(eval_df.style.hide_index())

# 10.0 Plan of Model Training

We will be using *TensorFlow 2.0* library to train the deep learning model. On top of that, we will be applying natural language processing (NLP) technique to the dataset. There are 2 major way on dealing text before feeding into neural network.

- One-Hot Encoding
  - The words in sentence are represented by columns of matrix
  - Matrix of one-hot encoding are sparse
  - Due to the simplicity of the representation, it doesn't work well in the natural language application as matrix is a 2-dimensional tensor
  - Only capture the word location
- Word Vector
  - This is a game changer as word can be represented in n-dimensional
  - Not only it can capture the word location but the word meaning too

In total, we will be training 6 different model on the training dataset.

Firstly, we will perform one-hot encoding on the text features. After preprocessing the train dataset, we will use logistic regression to train the model. Logistic regression will act as the benchmark model for later deep learning model. Moving on, we will use Deep Neural Network (DNN) and Long Short-Term Memory (LSTM) to train on the same dataset.

Afterwards, we will preprocess the word in dataset using word vector and use DNN and LSTM to train the preprocessed dataset.

All the training of deep learning model will incorporate early stopping callbacks to let the model to train to its optimum state. 

Lastly, we will use the state of the art of natural language processing model for this sentiment analysis task. By taking advantage of pre-trained model that been trained on a huge dataset, we can fine-tune it for our specific task.

# 11.0 Deep Learning Model

## 11.1 One-Hot Encoding

### 11.1.1 Benchmark Model - Logistic Regression

In [255]:
X_train = train_df['TopNews']
y_train = train_df['Label']

In [256]:
scores = []

tscv = TimeSeriesSplit(n_splits=2)

for train_index, val_index in tscv.split(X_train):
    X_train_cv, X_val_cv = X_train[train_index], X_train[val_index]
    y_train_cv, y_val_cv = y_train[train_index], y_train[val_index]

    # one-hot encoding
    vectorizer = CountVectorizer(lowercase=True)
    X_train_cv_ohe = vectorizer.fit_transform(X_train_cv)
    X_val_cv_ohe = vectorizer.transform(X_val_cv)

    # training model
    log_reg = LogisticRegression(random_state=42, verbose=0, max_iter=1000)
    log_reg.fit(X_train_cv_ohe, y_train_cv)

    # eval model
    y_pred = log_reg.predict(X_val_cv_ohe)

    scores.append(roc_auc_score(y_val_cv, y_pred))

In [257]:
add_model_eval('logistic regression (one hot encoding)', scores)

In [258]:
view_models_eval()

model,roc_auc
logistic regression (one hot encoding),0.49


As we can view the logistic regression model that train on one-hot encoding text dataset, the roc auc score is 0.49 only. It is as good as guessing the stock market will increase or decrease.

# 11.1.2 Deep Neural Network

In [308]:
scores = []

tscv = TimeSeriesSplit(n_splits=2)

for train_index, val_index in tscv.split(X_train):
    X_train_cv, X_val_cv = X_train[train_index], X_train[val_index]
    y_train_cv, y_val_cv = y_train[train_index], y_train[val_index]

    # one-hot encoding
    vectorizer = CountVectorizer(lowercase=True)
    X_train_cv_ohe = vectorizer.fit_transform(X_train_cv)
    X_val_cv_ohe = vectorizer.transform(X_val_cv)

    # convert into tensor
    X_train_cv_ohe = tf.constant(X_train_cv_ohe.toarray())
    X_val_cv_ohe = tf.constant(X_val_cv_ohe.toarray())
    y_train_cv = tf.constant(y_train_cv)
    y_val_cv = tf.constant(y_val_cv)

    # activate reproducible result
    tf.random.set_seed(42)

    # construct model
    input_shape = X_train_cv_ohe.shape[1]

    model = Sequential()
    model.add(Input(shape=(input_shape), name='input_layer'))
    model.add(Dense(256, activation='relu', name='hidden_layer_2'))
    model.add(Dense(128, activation='relu', name='hidden_layer_3'))  
    model.add(Dense(1, activation='sigmoid', name='output_layer'))

    # compile model
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=[AUC()])

    # optimisation
    early_stopping_cb = EarlyStopping(patience=5, restore_best_weights=True)

    # training model
    history = model.fit(X_train_cv_ohe, y_train_cv, epochs=100, 
                        validation_data=(X_val_cv_ohe, y_val_cv),
                        callbacks=[early_stopping_cb])
    

    # eval model
    y_pred = model.predict(X_val_cv_ohe)

    scores.append(roc_auc_score(y_val_cv, y_pred))

Train on 424 samples, validate on 424 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Train on 848 samples, validate on 424 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100


In [310]:
add_model_eval('dnn (one hot encoding)', scores)

In [311]:
view_models_eval()

model,roc_auc
logistic regression (one hot encoding),0.49
dnn (one hot encoding),0.48
