# <center> News classification with ULMFiT. Starter

Here we mostly follow the training scheme described by Jeremy Howard in [fast.ai Lesson 4](https://course.fast.ai/videos/?lesson=4): taking a pretrained language model, fine-tuning it with unlabeled data, then fine-tuning classification head for our particular task.

This is just a starter. At each step, I also mention how you can do better.

In [None]:
from tqdm import tqdm_notebook
import torch
import fastai
from fastai.text import *
fastai.__version__

# Preprocessing
Here we write all news texts from train, validation and text files into `unlabeled_news.csv` - to train a language model.

Then, we write texts and labels into `train_28k.csv` and texts only into `test_5k.csv`.

**How to do better:** go for that 80k unlabeled set as well.

In [None]:
train = pd.read_csv('../input/train.csv').fillna(' ')
valid = pd.read_csv('../input/valid.csv').fillna(' ')
test = pd.read_csv('../input/test.csv').fillna(' ')

In [None]:
pd.concat([train['text'], valid['text'], test['text']]).to_csv('unlabeled_news.csv', index=None, header=True)

In [None]:
pd.concat([train[['text', 'label']],valid[['text', 'label']]]).to_csv('train_28k.csv', index=None, header=True)
test[['text']].to_csv('test_5k.csv', index=None, header=True)

In [None]:
folder = '.'
unlabeled_file = 'unlabeled_news.csv'

# Reading unlabeled data to train ULMFiT language model

In [None]:
%%time
data_lm = TextLMDataBunch.from_csv(folder, unlabeled_file, text_cols='text')

# LM training 

Here we resort to the training scheme described by Jeremy Howard, [fast.ai](https://course.fast.ai/):
 - finding good initial learning rate
 - training for one epoch
 - unfreezing and more training

**How to do better:** train for 10-15 epochs after unfreezing

In [None]:
%%time
learn = language_model_learner(data_lm, drop_mult=0.3, arch=AWD_LSTM)

In [None]:
%%time
learn.lr_find(start_lr = slice(10e-7, 10e-5), end_lr=slice(0.1, 10))

In [None]:
learn.recorder.plot(skip_end=10, suggestion=True)

In [None]:
best_lm_lr = learn.recorder.min_grad_lr
best_lm_lr

In [None]:
%%time
learn.fit_one_cycle(1, best_lm_lr)

In [None]:
learn.unfreeze()

In [None]:
%%time
learn.fit(5, best_lm_lr)

# Generating some text

It's always interesting to see whether a LM is able to generate nice text. With LM training improvement (in terms of loss), at some point you'll notice some nice improvement in quality of the generated text.

One sample generated with my better-trained LM:

> 'An italian man was found dead in his yard due to heat conditions on Sunday night , his spokeswoman said . The office manager of the Ultra retired man ’s office told buzzfeed News there being no sign of comfort . The man at his 911 home told guy , he had been in contact with his car ’s owner before asleep and then immediately responded to starting fire . The man named Guy made a news video at PARKING Station in which the Mississippi State Police shot multiple people with Tim Shepherd to get their son alive , Mark Morris , a family friend dangling near his wife ’s house , said . The teen told police he was winning inclusion in general . Police dragged him into the house — where the officer had been yards away — during his die - hard bid at a nearby snow salon . The family voted in favor of Appreciative and arrested more than three months later : They tried to detained him . He and his family stopped , per the station , all the way up . “'

No much sense, but at least some structure :) And now with GPT-2 we see that quantitative improvements can also lead to qualital improvements.

In [None]:
learn.predict('An italian man was found dead in his yard due to', n_words=200)

In [None]:
learn.save_encoder('clickbait_news_enc')

# Training classification head

Here again we follow Jeremy Howard. 

**How to do better:** hyperparam tuning (though it's extremely annoying with such a heavy model), more epochs after unfreezing, check for some live examples of ULMFiT training, different learning rates for different layers etc.

In [None]:
train_file, test_file = 'train_28k.csv', 'test_5k.csv'

In [None]:
data_clas = TextClasDataBunch.from_csv(path=folder, 
                                        csv_name=train_file,
                                        test=test_file,
                                        vocab=data_lm.train_ds.vocab, 
                                        bs=64,
                                        text_cols='text', 
                                        label_cols='label')

In [None]:
data_clas.save('ulmfit_data_clas_clickbait_news')

In [None]:
learn_clas = text_classifier_learner(data_clas, drop_mult=0.3, arch=AWD_LSTM)
learn_clas.load_encoder('clickbait_news_enc')

In [None]:
learn_clas.lr_find(start_lr = slice(10e-7, 10e-5), end_lr=slice(0.1, 10))

In [None]:
learn_clas.recorder.plot(skip_end=10, suggestion=True)

In [None]:
best_clf_lr = learn_clas.recorder.min_grad_lr
best_clf_lr

In [None]:
learn_clas.fit_one_cycle(1, best_clf_lr)

In [None]:
learn_clas.freeze_to(-2)

In [None]:
learn_clas.fit_one_cycle(1, best_clf_lr)

In [None]:
learn_clas.unfreeze()

In [None]:
learn_clas.fit(5, best_clf_lr)

In [None]:
learn_clas.show_results()

# Predictions for the test set

Thanks to [Noisefield](https://www.kaggle.com/mamamot) for his comments on how to do it efficiently.

In [None]:
data_clas.add_test(test["text"])

In [None]:
test_preds, _ = learn_clas.get_preds(DatasetType.Test, ordered=True)

# Forming a submission file

In [None]:
test_pred_df = pd.DataFrame(test_preds.data.cpu().numpy(),
                            columns=['clickbait', 'news', 'other'])
ulmfit_preds = pd.Series(np.argmax(test_pred_df.values, axis=1),
                        name='label').map({0: 'clickbait', 1: 'news', 2: 'other'})


In [None]:
ulmfit_preds.head()

In [None]:
ulmfit_preds.to_csv('ulmfit_predictions_advanced.csv', index_label='id', header=True)