In [3]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

from tqdm import tqdm


Let's see if we can join the daily headlines to price movements in the stock market. Something like how the daily newspapers effect sentiment on the markets as a whole. We could then make a guess at what will happen tomorrow based on today's events and tomorrow's likely headlines.

I'll work with a subset on here so that I can explore the data quickly and make decisions, before removing all the sub-setting and running it for real on a more powerful platform.

In [5]:
news = pd.read_csv("kaggle/input/million-headlines/abcnews-date-text.csv")#, nrows=100000)

news["publish_date"] = pd.to_datetime(news["publish_date"].astype(str),infer_datetime_format=True)

news.head()

Unnamed: 0,publish_date,headline_text
0,2003-02-19,aba decides against community broadcasting lic...
1,2003-02-19,act fire witnesses must be aware of defamation
2,2003-02-19,a g calls for infrastructure protection summit
3,2003-02-19,air nz staff in aust strike for pay rise
4,2003-02-19,air nz strike to affect australian travellers


In [7]:
import string
from nltk.corpus import stopwords

STOPWORDS = stopwords.words('english') + ['u', 'ü', 'ur', '4', '2', 'im', 'dont', 'doin', 'ure']

    
def text_process(mess):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    
    # Check characters to see if they are in punctuation
    nopunc = [char for char in mess if char not in string.punctuation]
    
    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    # Now just remove any stopwords
    return ' '.join([word for word in nopunc.split() if word.lower() not in STOPWORDS])

Remove the stopwords and punctuation from the headlines

In [4]:
news["headline_text"] = news.headline_text.apply(text_process)

news.head()

Unnamed: 0,publish_date,headline_text
0,2003-02-19,aba decides community broadcasting licence
1,2003-02-19,act fire witnesses must aware defamation
2,2003-02-19,g calls infrastructure protection summit
3,2003-02-19,air nz staff aust strike pay rise
4,2003-02-19,air nz strike affect australian travellers


Group the headlines by date so we can see a single date on each row. We can then get a picture of what kind of a day that was good, bad, terrible...

In [5]:
news = news.groupby("publish_date")["headline_text"].agg(' '.join).to_frame()



In [6]:
news.index.names = ["date"]
news.head()

Unnamed: 0_level_0,headline_text
date,Unnamed: 1_level_1
2003-02-19,aba decides community broadcasting licence act...
2003-02-20,15 dead rebel bombing raid philippines army ab...
2003-02-21,accc timid petrol price investigations action ...
2003-02-22,86 confirmed dead us nightclub fire act touris...
2003-02-23,accused people smuggler face darwin court act ...


In [7]:
news.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6882 entries, 2003-02-19 to 2021-12-31
Data columns (total 1 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   headline_text  6882 non-null   object
dtypes: object(1)
memory usage: 107.5+ KB


Let's bring in the nasdaq data... should be interesting.

I think we can ignore most of the data though, as we're looking at a broad brush picture we only need the open price and the close price. We can then calculate a daily change.

Let's look at one stock symbol and then write a function around that import.

In [8]:
stock_data = pd.read_csv("/kaggle/input/nasdaq-daily-stock-prices/LICN.csv",index_col='date')
stock_data.head()

Unnamed: 0_level_0,ticker,open,high,low,close
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2023-02-06,LICN,3.9,4.25,3.08,3.5
2023-02-07,LICN,3.22,3.55,2.82,3.45
2023-02-08,LICN,3.5,3.6499,3.2701,3.4201
2023-02-09,LICN,3.55,3.58,3.02,3.53
2023-02-10,LICN,3.39,4.1,3.38,4.1


Calculate the daily change, and name it after the stock ticker. This should allow us to easily make a multi-column dataframe

In [9]:
ticker = stock_data["ticker"][0]

stock_data[ticker] = (stock_data["close"] - stock_data["open"])/stock_data["open"]


trimmed_stock = stock_data[[c for c in stock_data.columns if c in {"date",ticker}]]

trimmed_stock.head()

Unnamed: 0_level_0,LICN
date,Unnamed: 1_level_1
2023-02-06,-0.102564
2023-02-07,0.071429
2023-02-08,-0.022829
2023-02-09,-0.005634
2023-02-10,0.20944


Let's bring in another set of values with a similar set of commands and try to merge the two DataFrames.

In [10]:
stock_data_b = pd.read_csv("/kaggle/input/nasdaq-daily-stock-prices/ACRV.csv",index_col='date')
stock_data_b.head()


Unnamed: 0_level_0,ticker,open,high,low,close
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2022-11-15,ACRV,13.35,17.09,12.7083,16.64
2022-11-16,ACRV,15.9,20.7031,15.06,16.75
2022-11-17,ACRV,16.53,19.5,14.88,15.72
2022-11-18,ACRV,15.72,18.95,15.59,15.78
2022-11-21,ACRV,15.77,16.8489,12.5,12.51


In [11]:
ticker = stock_data_b["ticker"][0]
stock_data_b[ticker] = (stock_data_b["close"] - stock_data_b["open"])/stock_data_b["open"]


trimmed_stock_b = stock_data_b[[c for c in stock_data_b.columns if c in {"date",ticker}]]

trimmed_stock_b.head()

Unnamed: 0_level_0,ACRV
date,Unnamed: 1_level_1
2022-11-15,0.246442
2022-11-16,0.053459
2022-11-17,-0.049002
2022-11-18,0.003817
2022-11-21,-0.206722


Let's merge these somehow!

In [12]:
trimmed_stock.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37 entries, 2023-02-06 to 2023-03-29
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   LICN    37 non-null     float64
dtypes: float64(1)
memory usage: 592.0+ bytes


In [13]:
trimmed_stock_b.info()

<class 'pandas.core.frame.DataFrame'>
Index: 92 entries, 2022-11-15 to 2023-03-29
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ACRV    92 non-null     float64
dtypes: float64(1)
memory usage: 1.4+ KB


Should be an outer join, as we want data from as many days as possible - even if some stocks didn't exist on those days.
That means that since there's 92 & 37 elements in our test data there should be 92 in our merged data.

In [14]:
merged_stock = trimmed_stock.merge(trimmed_stock_b,how="outer", on="date")

merged_stock.head()

Unnamed: 0_level_0,LICN,ACRV
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-02-06,-0.102564,0.039
2023-02-07,0.071429,0.059359
2023-02-08,-0.022829,0.008949
2023-02-09,-0.005634,-0.009362
2023-02-10,0.20944,-0.001818


In [15]:
merged_stock.info()

<class 'pandas.core.frame.DataFrame'>
Index: 92 entries, 2023-02-06 to 2023-02-03
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   LICN    37 non-null     float64
 1   ACRV    92 non-null     float64
dtypes: float64(2)
memory usage: 2.2+ KB


In [16]:
merged_stock.tail()

Unnamed: 0_level_0,LICN,ACRV
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-01-30,,0.002083
2023-01-31,,0.084337
2023-02-01,,-0.132275
2023-02-02,,0.097411
2023-02-03,,0.070225


OK, that seems to have worked. So now we put that into a function and a loop!

Add a limiter to make sure we're not here all day, remove limiter in real code.

In [17]:
def import_data(fileName : str, targetFrame : pd.DataFrame):
    stock_data = pd.read_csv(fileName,index_col='date')
    ticker = stock_data["ticker"][0]
    stock_data[ticker] = (stock_data["close"] - stock_data["open"])/stock_data["open"]
    
    trimmed_stock = stock_data[[c for c in stock_data.columns if c in {"date",ticker}]]
    if targetFrame.size > 0:
        return targetFrame.merge(trimmed_stock,how="outer", on="date")
    else:
        return trimmed_stock
    


In [18]:
counter = 0
bigStockMerge = pd.DataFrame()
for dirname, _, filenames in os.walk('/kaggle/input/nasdaq-daily-stock-prices/'):
    for filename in filenames:
        counter += 1
        thisFile = os.path.join(dirname, filename)
        bigStockMerge = import_data(thisFile, bigStockMerge)
        if counter >5:
            break
            
            
bigStockMerge.head()

Unnamed: 0_level_0,LICN,LAB,ISRL,AMSF,DFLI,ACRV
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023-02-06,-0.102564,-0.014634,,0.012814,0.007342,0.039
2023-02-07,0.071429,0.025,,0.003036,-0.001447,0.059359
2023-02-08,-0.022829,0.029851,,-0.020054,-0.033473,0.008949
2023-02-09,-0.005634,-0.019048,,-0.032676,-0.05303,-0.009362
2023-02-10,0.20944,-0.004854,,0.013787,-0.028256,-0.001818


Now we need some averages

In [19]:
bigStockMerge["Mean"] = bigStockMerge.mean(axis=1)
bigStockMerge.head(10)

Unnamed: 0_level_0,LICN,LAB,ISRL,AMSF,DFLI,ACRV,Mean
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2023-02-06,-0.102564,-0.014634,,0.012814,0.007342,0.039,-0.011609
2023-02-07,0.071429,0.025,,0.003036,-0.001447,0.059359,0.031475
2023-02-08,-0.022829,0.029851,,-0.020054,-0.033473,0.008949,-0.007511
2023-02-09,-0.005634,-0.019048,,-0.032676,-0.05303,-0.009362,-0.02395
2023-02-10,0.20944,-0.004854,,0.013787,-0.028256,-0.001818,0.03766
2023-02-13,0.115294,0.034483,,0.008012,-0.146341,0.072348,0.016759
2023-02-14,-0.486792,0.0,,-0.011345,-0.027143,0.0131,-0.102436
2023-02-15,0.071698,-0.036697,,0.027358,-0.062774,-0.025974,-0.005278
2023-02-16,0.001149,0.019417,,0.005009,-0.052308,0.040253,0.002704
2023-02-17,0.02,0.004808,,0.019982,-0.026622,-0.042572,-0.004881


Ok, so we have headlines, and movements. Now we want to join the two so we can try to model what influence the headlines have on the movements.

This time though, it's an inner join, as there's no point knowing what the move was on a day when we don't know what the headlines were or vice-versa.

In [20]:
bigStockMerge.index = pd.to_datetime(bigStockMerge.index)

bigStockMerge.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4368 entries, 2023-02-06 to 2011-02-14
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   LICN    37 non-null     float64
 1   LAB     3049 non-null   float64
 2   ISRL    22 non-null     float64
 3   AMSF    4367 non-null   float64
 4   DFLI    334 non-null    float64
 5   ACRV    92 non-null     float64
 6   Mean    4368 non-null   float64
dtypes: float64(7)
memory usage: 273.0 KB


In [21]:
final_data = news.merge(bigStockMerge,how="inner",on="date")
final_data.head()

Unnamed: 0_level_0,headline_text,LICN,LAB,ISRL,AMSF,DFLI,ACRV,Mean
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2005-11-17,650m pulp mill plan south east sa abc grandsta...,,,,0.0,,,0.0
2005-11-18,abalone diver released good behaviour bond abb...,,,,-0.002142,,,-0.002142
2005-11-21,act bushfire victims suffer post traumatic str...,,,,0.024203,,,0.024203
2005-11-22,35000 die qld bird flu worst case scenario acc...,,,,0.030303,,,0.030303
2005-11-23,13m go vline trains revamp 23m power station n...,,,,0.001068,,,0.001068


If we look at a small subset of the news there may be no crossover. So we go back and remove the import limitations.

In [22]:
final_data = news.merge(bigStockMerge,how="inner",on="date")
final_data = final_data[[c for c in final_data.columns if c in {"date","headline_text","Mean"}]]
final_data.head()

Unnamed: 0_level_0,headline_text,Mean
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2005-11-17,650m pulp mill plan south east sa abc grandsta...,0.0
2005-11-18,abalone diver released good behaviour bond abb...,-0.002142
2005-11-21,act bushfire victims suffer post traumatic str...,0.024203
2005-11-22,35000 die qld bird flu worst case scenario acc...,0.030303
2005-11-23,13m go vline trains revamp 23m power station n...,0.001068


Vectorize the news headlines:

In [23]:
from sklearn.feature_extraction.text import CountVectorizer

# instantiate the vectorizer
vect = CountVectorizer()
vect.fit(final_data["headline_text"])


# fit and transform
news_vector = vect.fit_transform(final_data["headline_text"])

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
tfidf_transformer.fit(news_vector)
tfidf_transformer.transform(news_vector)


<4053x94214 sparse matrix of type '<class 'numpy.float64'>'
	with 3496556 stored elements in Compressed Sparse Row format>

In [24]:
from sklearn.model_selection import train_test_split

#define the labels/targets/whatever
movements = final_data.Mean


#split out test and train subsets:
X_train, X_test, Y_train, Y_test = train_test_split(news_vector, movements, random_state=1)

The movement is a continuous amount, assumed to be some funtion <math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>f</mi>
  <mo stretchy="false">(</mo>
  <mo>&#x22C5;</mo>
  <mo stretchy="false">)</mo>
  <mo>:</mo>
  <msup>
    <mi>R</mi>
    <mi>m</mi>
  </msup>
  <mo stretchy="false">&#x2192;</mo>
  <msup>
    <mi>R</mi>
    <mi>o</mi>
  </msup>
</math> of the input text. We'll therefor try using an MLP as here:


https://scikit-learn.org/stable/modules/neural_networks_supervised.html#neural-networks-supervised

In [25]:
print(X_train.shape)
print(Y_train.shape)

(3039, 94214)
(3039,)


In [26]:
from sklearn.neural_network import MLPRegressor

#use small hidden layer for testing
clf = MLPRegressor(solver='lbfgs', alpha=1e-5,hidden_layer_sizes=(100,2), random_state=1,verbose=True, max_iter=10)

clf = clf.fit(X_train,Y_train)

RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =      9421705     M =           10

At X0         0 variables are exactly at the bounds


 This problem is unconstrained.



At iterate    0    f=  5.25035D-01    |proj g|=  1.02397D+00

At iterate    1    f=  2.19636D-01    |proj g|=  6.62328D-01

At iterate    2    f=  7.64897D-02    |proj g|=  3.90362D-01

At iterate    3    f=  1.78837D-03    |proj g|=  6.24784D-02

At iterate    4    f=  1.48104D-03    |proj g|=  1.59827D-02

At iterate    5    f=  1.39588D-03    |proj g|=  1.21578D-02

At iterate    6    f=  8.28524D-04    |proj g|=  4.10217D-02

At iterate    7    f=  3.93658D-04    |proj g|=  3.12000D-02

At iterate    8    f=  3.05254D-04    |proj g|=  2.66094D-03

At iterate    9    f=  2.98277D-04    |proj g|=  6.85052D-03

At iterate   10    f=  2.96104D-04    |proj g|=  1.99260D-04

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F   

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


So, we have a model. It's not going to great, but this is a notebook... What predictions will it make?

In [28]:
from sklearn.metrics import r2_score

Y_Predict = clf.predict(X_test)

score = r2_score(Y_test,Y_Predict)

print("The accuracy of our model is {}%".format(round(score, 2) *100))


The accuracy of our model is -1.0%


Atrocious.

In [29]:
from sklearn.metrics import mean_absolute_error
score = mean_absolute_error(Y_test,Y_Predict)
print("The Mean Absolute Error of our Model is {}".format(round(score, 2)))

The Mean Absolute Error of our Model is 0.02


So the mean error is small, but the accuracy is pretty low it seems. If we move this all to code, then it might improve over time.