<a href="https://colab.research.google.com/github/FrancescoMK/Sentiment-Analysis-for-Stock-Market-Prediction/blob/master/Sentiment_Analysis_for_Price_Movement_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Analysis for Price Movement Prediction**


In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"
base_dir = root_dir + 'fastai-v3/'

In [None]:
from fastai.text import *
from fastai.callbacks import *

### **Data Manipulation**

**Getting the data**

In [None]:
path = f'data/'

*dowloading the data*

In [None]:
!wget https://s3.amazonaws.com/fast-ai-nlp/ag_news_csv.tgz -P {path}

*extracting the data*

In [None]:
!tar -zxf {path}/'ag_news_csv.tgz' -C {path}

**Preparing the data**

In [None]:
path = Path('data/ag_news_csv')
path.ls()

*explore and name .csv files*

In [None]:
df_train = pd.read_csv(path/'train.csv',  names=["class", "title", "description"])
df_train.head()

In [None]:
df_test = pd.read_csv(path/'test.csv',  names=["class", "title", "description"])
df_test.head()

*take the union of df_train and df_test*

In [None]:
df = df_train.append(df_test, ignore_index=True)

In [None]:
len(df), len(df_train), len(df_test)

In [None]:
df.head()

### **Concatenated Language Model**

**Create data object with the data block API**

*convert words into numbers in steps: tokenization and numericalization*

In [None]:
data_lm = (TextList.from_df(df, path, cols=['title', 'description'])
                .split_by_rand_pct(0.1)
                .label_for_lm()
                .databunch())

*save and load data object*

In [None]:
data_lm.save('data_lm')

In [None]:
data_lm = load_data(path, 'data_lm')

*show data object after tokenisation and numericalisation*

In [None]:
len(data_lm.train_ds), len(data_lm.valid_ds)

In [None]:
data_lm.show_batch()

*datasets are now represented through tokenized and numericalised text*

In [None]:
data_lm.train_ds[100][0]

*correspondance from ids to tokens is stored in the vocab*
*attribute of the datasets, in a dictionary called itos*

In [None]:
data_lm.vocab.itos[:10]

In [None]:
data_lm.batch_size

**Pretrained model on wiki data**


*The pretrained model is downloaded by specifying the arch=AWD_LSTM and*

*pretrained=True, so fastai automatically downloads the pretrained model*

In [None]:
learn = language_model_learner(data_lm, arch=AWD_LSTM,  drop_mult=0)

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot(skip_end=15)

*saving the best model after every epoch*

In [None]:
learn.callback_fns.append(partial(SaveModelCallback, name='lm-stage-1'))

**Fit model**

In [None]:
learn.fit_one_cycle(3, 3.98E-02)

In [None]:
learn.load('lm-stage-1');

*fit model over 7 epochs*

In [None]:
# learn.callback_fns.pop()
# learn.callback_fns.append(partial(SaveModelCallback, name='lm-stage-2'))

In [None]:
# learn.fit_one_cycle(7, 3.98E-02)

In [None]:
# learn.load('lm-stage-2');

**Unfreeze and fine tune all the layers**

In [None]:
learn.unfreeze()

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot()

*fit the fine tuned model*

In [None]:
# learn.callback_fns.pop()
learn.callback_fns.append(partial(SaveModelCallback, name='lm-unfreeze-1'))

In [None]:
learn.fit_one_cycle(3, 3e-04)

**Predict and Save**

*make prediction out of model*

In [None]:
learn.load('lm-unfreeze-1');

In [None]:
TEXT = "This year is going to challenge"
N_WORDS = 40
N_SENTENCES = 2

In [None]:
print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

*save the model and its encoder responsible* 

*for creating and updating the hidden state*

In [None]:
learn.save_encoder('fine_tuned_enc')

### **Sentiment Classifier**

**Create a new data object**

In [None]:
path = untar_data(URLs.IMDB)

*only grabs the labelled data and keeps those labels*

In [None]:
data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)
             #grab all the text files in path
             .split_by_folder(valid='test')
             #split by train and valid folder (that only keeps 'train' and 'test' so no need to filter)
             .label_from_folder(classes=['neg', 'pos'])
             #label them all with their folders
             .databunch(bs=bs))

data_clas.save('data_clas.pkl')

In [None]:
data_clas = load_data(path, 'data_clas.pkl', bs=bs)

In [None]:
data_clas.show_batch()

**Model Creation**

*create a model to classify those reviews and load the encoder*

In [None]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
learn.load_encoder('fine_tuned_enc')

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot()

*first fit*

In [None]:
learn.fit_one_cycle(1, 2e-2, moms=(0.8,0.7))

In [None]:
learn.save('first')

In [None]:
learn.load('first');

*second fit*

In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))

In [None]:
learn.save('second')

In [None]:
learn.load('second');

*third fit*

In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))

In [None]:
learn.save('third')

In [None]:
learn.load('third');

**Prediction**

*unfreeze and fit*

In [None]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7))

In [None]:
learn.save('unfrozen')

In [None]:
learn.load('unfrozen')

*make prediction*

In [None]:
learn.predict("I really loved that movie, it was awesome!");

**Export Model for Reuters**

In [None]:
learn.export(file = 'export_clas.pkl')

### **Apply to Stocks (Draft)**

**Get Reuters Data**

In [None]:
import pandas as pd

In [None]:
from google.colab import files
data_to_load = files.upload()

In [None]:
import io

df_predict = pd.read_csv(io.BytesIO(uploaded['Reuters_Cleaned_V2.csv']), names=["ID", "Source", "Text"])
df_predict.head()

**Apply Model to Data**

In [None]:
learn = load_learner(df_predict, file = 'export_clas.pkl')

In [None]:
learn.data.add_test