# Requirements

In [22]:
# Add as many imports as you need.

# Laboratory Exercise - Run Mode (8 points)

## Introduction
In this laboratory assignment, the primary objective is to use Long Short-Term Memory (LSTM) networks for time series forecasting in order to predict the current **close price** of the Dow Jones Industrial Average index. To accomplish this use data from the past 7 days, which includes numeric information and news information. The goal is to employ LSTM, a type of recurrent neural network, to effectively forecast one future step for the index price (the following day).


## The DIJA Dataset

This dataset consists of daily price records for the value of the Dow Jones Industrial Average index. The dataset includes the following attributes:

- Date - date in the format YYYY-MM-DD,
- Open - open price of the index on the specified date
- Close - close price of the index on the specified date
- High - high price of the index on the specified date
- Low - low price of the index on the specified date
- Volume - number of trades



## The Reddit News Dataset

This dataset consists of news headlines for a certain date that might impact the price:

- Date - date in the format YYYY-MM-DD,
- News - news headline scraped from Reddit

<b>Note: You might have multiple headlines for each date. The number of news per date might not be the same for each date. <b>

Load the datasets into a `pandas` data frame.

In [23]:
import pandas as pd

# Write your code here. Add as many boxes as you need.
df_dija = pd.read_csv("./DJIA_table.csv")
df_reddit = pd.read_csv("./RedditNews.csv")

In [24]:
df_dija.sample(3)

Unnamed: 0,Date,Open,High,Low,Close,Volume
31,5/18/2016,17501.2793,17636.2207,17418.21094,17526.61914,79120000
1918,11/17/2008,8494.839844,8571.299805,8246.889648,8273.580078,278220000
198,9/18/2015,16674.74023,16674.74023,16343.75977,16384.58008,341690000


In [25]:
df_dija["Date"] = pd.to_datetime(df_dija["Date"])
df_dija.set_index(keys=["Date"], inplace=True)
df_dija.sort_index(inplace=True)
df_dija.sample(5)

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-01-24,16203.29004,16203.29004,15879.11035,15879.11035,141450000
2012-01-09,12359.30957,12409.08008,12333.84961,12392.69043,122200000
2015-12-21,17154.93945,17272.35938,17116.73047,17251.61914,114910000
2008-08-18,11659.65039,11690.42969,11434.12012,11479.38965,156290000
2011-07-28,12301.71973,12384.90039,12226.83008,12240.11035,148710000


In [38]:
df_reddit.sample(3)

Unnamed: 0,Date,News
31979,2012-12-29,Heirs of Maos Comrades Rise as New Capitalist ...
53192,2010-09-03,NATO attack kills 10 civilian campaign workers...
51143,2010-11-24,Iran's parliament revealed it planned to impea...


In [39]:
df_reddit["Date"] = pd.to_datetime(df_reddit["Date"])
df_reddit.set_index(keys=["Date"], inplace=True)
df_reddit.sort_index(inplace=True)
df_reddit.sample(5)

Unnamed: 0_level_0,News
Date,Unnamed: 1_level_1
2014-08-14,The Australian government's chief business adv...
2009-01-29,b'Swiss police find massive marijuana farm on ...
2016-01-29,Russian fighter came within 15 feet of U.S. Ai...
2010-02-14,b'Brussels train crash: at least 20 people fea...
2013-08-30,Canada will not join the U.S. and U.K. in a mi...


In [40]:
df = pd.merge(left=df_dija, right=df_reddit, right_index=True, left_index=True)


Merge the datasets (be careful you can get multiple rows per date which is not desirable)

In [46]:
df_reddit.shape[0]

73608

In [41]:
# Write your code here. Add as many boxes as you need.
df

Unnamed: 0_level_0,Open,High,Low,Close,Volume,News
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2008-08-08,11432.08984,11759.95996,11388.04004,11734.32031,212830000,b'Georgian troops retreat from S. Osettain cap...
2008-08-08,11432.08984,11759.95996,11388.04004,11734.32031,212830000,"b""The 'enemy combatent' trials are nothing but..."
2008-08-08,11432.08984,11759.95996,11388.04004,11734.32031,212830000,"b""Breaking: Georgia invades South Ossetia, Rus..."
2008-08-08,11432.08984,11759.95996,11388.04004,11734.32031,212830000,b'150 Russian tanks have entered South Ossetia...
2008-08-08,11432.08984,11759.95996,11388.04004,11734.32031,212830000,b'Did the U.S. Prep Georgia for War with Russia?'
...,...,...,...,...,...,...
2016-07-01,17924.24023,18002.38086,17916.91016,17949.36914,82160000,"Venezuela, where anger over food shortages is ..."
2016-07-01,17924.24023,18002.38086,17916.91016,17949.36914,82160000,A Hindu temple worker has been killed by three...
2016-07-01,17924.24023,18002.38086,17916.91016,17949.36914,82160000,Ozone layer hole seems to be healing - US &amp...
2016-07-01,17924.24023,18002.38086,17916.91016,17949.36914,82160000,Taiwanese warship accidentally fires missile t...


In [47]:
df.columns[:-1]

Index(['Open', 'High', 'Low', 'Close', 'Volume'], dtype='object')

In [48]:
aggregation = {
    col: 'first' for col in df.columns[:-1]
}
aggregation.update({'News': '\n'.join})

In [49]:
aggregation

{'Open': 'first',
 'High': 'first',
 'Low': 'first',
 'Close': 'first',
 'Volume': 'first',
 'News': <function str.join(iterable, /)>}

In [50]:
df = df.groupby('Date').agg(aggregation)
df

Unnamed: 0_level_0,Open,High,Low,Close,Volume,News
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2008-08-08,11432.08984,11759.95996,11388.04004,11734.32031,212830000,b'Georgian troops retreat from S. Osettain cap...
2008-08-11,11729.66992,11867.11035,11675.53027,11782.34961,183190000,b'Russia angered by Israeli military sale to G...
2008-08-12,11781.70020,11782.34961,11601.51953,11642.46973,173590000,b'U.S. Beats War Drum as Iran Dumps the Dollar...
2008-08-13,11632.80957,11633.78027,11453.33984,11532.95996,182550000,"b""Bush announces Operation Get All Up In Russi..."
2008-08-14,11532.07031,11718.28027,11450.88965,11615.92969,159790000,b'Poland and US agree to missle defense deal. ...
...,...,...,...,...,...,...
2016-06-27,17355.21094,17355.21094,17063.08008,17140.24023,138740000,Angela Merkel said the U.K. must file exit pap...
2016-06-28,17190.50977,17409.72070,17190.50977,17409.72070,112190000,Hong Kong democracy activists call for return ...
2016-06-29,17456.01953,17704.50977,17456.01953,17694.67969,106380000,A chatbot programmed by a British teenager has...
2016-06-30,17712.75977,17930.60938,17711.80078,17929.99023,133030000,US airstrikes kill at least 250 ISIS fighters ...


In [51]:
from transformers import pipeline
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

KeyboardInterrupt: 

In [None]:
sequence_to_classify = "one day I will see the world"
candidate_labels = ['travel', 'cooking', 'dancing']
classifier(sequence_to_classify, candidate_labels)

## Feauture Extraction


1. DIJA Table
Apply a lag of one, up to 7 days to each feature, creating a set of features representing the index price from the previous 7 days. To maintain dataset integrity, eliminate any resulting missing values at the beginning of the dataset.

2. Reddit News Table
Create a numeric representation for the news (for example average embedding or average sentiment). <b> You must create lags of the news features as well since we will not know the news for the future. </b>

Hint: Use `df['column_name'].shift(period)`. Check the documentation at https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html.

In [27]:
# Write your code here. Add as many boxes as you need.

## Dataset Splitting
Partition the dataset into training and testing sets with an 80:20 ratio.

**WARNING: DO NOT SHUFFLE THE DATASET.**



In [28]:
# Write your code here. Add as many boxes as you need.

## Feauture Scaling
Scale the extracted features using an appropriate scaler if needed.

In [29]:
# Write your code here. Add as many boxes as you need.

## Feature Reshaping

Reshape the feature dimensions into the shape `(samples, timesteps, features)`.

In [30]:
# Write your code here. Add as many boxes as you need.

## Long Short-Term Memory (LSTM) Network


Define the forecasting model using the **Keras Sequential API** (`keras.models.Sequential`), incorporating one or more LSTM layers along with additional relevant layers (`keras.layers`). Be cautious when specifying the configuration of the final layer to ensure proper model output for the forecasting task.

In [31]:
# Write your code here. Add as many boxes as you need.

Compile the previously defined model specifying **loss function** (`keras.losses`), **optimizer** (`keras.optimizers`) and **evaluation metrics** (`keras.metics`).

In [32]:
# Write your code here. Add as many boxes as you need.

Train the model on the training set, specifying the **batch size** and **number of epochs** for the training process. Allocate 20% of the samples for **validation**, and ensure that the dataset remains **unshuffled** during training.

In [33]:
# Write your code here. Add as many boxes as you need.

Create a line plot illustrating both the **training** and **validation loss** over the training epochs.

In [34]:
# Write your code here. Add as many boxes as you need.

Use the trained model to make predictions for the test set.

In [35]:
# Write your code here. Add as many boxes as you need.

Assess the performance of the model by using different metrics provided by the `scikit-learn` library.

In [36]:
# Write your code here. Add as many boxes as you need.

Create a line plot in order to compare the actual and predicted mean temperatures for the test set.

In [37]:
# Write your code here. Add as many boxes as you need.