### Table of Contents <a class="anchor" id="part0"></a>

* [EDA](#part1)
    * [Loading data](#section_1_1)
    * [EDA](#section_1_2)
* [Transfer Learning](#part2)  
    * [](#section_2_1)
    * [](#section_2_2)

In [60]:
from datasets import load_dataset
import tensorflow as tf
import pynvml
import matplotlib.pyplot as plt

# EDA<a class="anchor" id="part1"></a>
## Loading Data <a class="anchor" id="section_1_1"></a>

In [None]:
#dataset = load_dataset('reddit', download_mode= "reuse_cache_if_exists")

Downloading builder script: 4.38kB [00:00, 4.39MB/s]                   
Downloading metadata: 2.83kB [00:00, 2.84MB/s]                   
Using custom data configuration default


Downloading and preparing dataset reddit/default (download: 2.93 GiB, generated: 17.64 GiB, post-processed: Unknown size, total: 20.57 GiB) to C:\Users\VR\.cache\huggingface\datasets\reddit\default\1.0.0\98ba5abea674d3178f7588aa6518a5510dc0c6fa8176d9653a3546d5afcb3969...


Downloading data: 100%|██████████| 3.14G/3.14G [09:05<00:00, 5.76MB/s] 


In [13]:
dataset

DatasetDict({
    train: Dataset({
        features: ['author', 'body', 'normalizedBody', 'subreddit', 'subreddit_id', 'id', 'content', 'summary'],
        num_rows: 3848330
    })
})

In [19]:
dataset['train'][4]
#we only need the columns 'content' and 'summary' for training

{'author': 'NuffZetPand0ra',
 'body': "You are talking about the Charsi imbue, right? Or a cube upgrade?\nIf we are talking Charsi imbue, you can only imbue WHITE items. This includes superior, but they will not neccesarily be superior after imbuing (they get random base-modifications). Bloodfist and Gorefoot are both uniques (gold), and therefore not eligible for imbuing.\nWhen you imbue, the item level matters (the item level is hidden). The item is the same level as the monster who dropped it. That means, that the higher level the monster who dropped it, the more stats is available on that item. It is important to note that an item doesn't neccesarily use all it's stat potential. This means that the same item dropped in a1 and a2 can has the possibility of some very different outcomes.\nAfter the imbue, the item can be as good as if the monster itself had dropped a rare (yellow) item. Imbued weapons will always turn out as rare items.\nTo answer your question, you should just progre

In [20]:
ds = dataset.remove_columns(['author',
  'body',
  'normalizedBody',
  'subreddit',
  'subreddit_id',
  'id'])
ds

DatasetDict({
    train: Dataset({
        features: ['content', 'summary'],
        num_rows: 3848330
    })
})

### We'll shuffle the dataset and keep one million of examples. It's a very large dataset and we need to find  balance between efficient training and computational resources. In the paper "RL4LM" the authors used CNN daily dataset for fine-tuning an abstractive summarization model. The dataset contained 311K examples with 300K being train-validation and 11K being test. We'll keep 400K examples from our dataset for fine-tuning our model.


In [22]:
#first we'll shuffle dataset to ensure the randomness of examples
shuffled_ds = ds.shuffle(seed=42)

Loading cached shuffled indices for dataset at C:\Users\VR\.cache\huggingface\datasets\reddit\default\1.0.0\98ba5abea674d3178f7588aa6518a5510dc0c6fa8176d9653a3546d5afcb3969\cache-1bd07cb9828bf1ee.arrow


## EDA <a class="anchor" id="section_1_1"></a>

In [94]:
#we need to ensure that we get quality samples for the training. Lets first check the length of the posts and summaries
num_tokens = []
for post in (shuffled_ds['train']['content']):
    words = post.split(' ')
    num_tokens.append(len(words))
num_tokens = sorted(num_tokens)

In [95]:
print(num_tokens[0], num_tokens[-1])

1 20964


In [104]:
print(num_tokens[int(len(num_tokens)/2)])
#median post length

194


### As we can see, the majorty of posts are under 2000 tokens/words long,  with a number of long outliers. There are also very short posts that do not contain enough information to create a summary.  For our dataset we will cap post length at 350 words to avoid computational overload, we will also remove any posts that are less than 20 words long, as we don't believe it would be sufficient for summary creation.

In [119]:
#function to remove posts that are too short or too long, or if their summary is longer than the post itself
def len_content(post):
    content = post['content'].split(' ')
    summary = post['summary'].split(' ')
    
    if len(content)<=350 and len(content)>=20 and len(content)>len(summary):
        return True

In [120]:
raw_ds = shuffled_ds['train'].filter(lambda x: len_content(x))
raw_ds

100%|██████████| 3849/3849 [02:02<00:00, 31.48ba/s]


Dataset({
    features: ['content', 'summary'],
    num_rows: 2824498
})

In [121]:
num_tokens = []
for post in (raw_ds['content']):
    words = post.split(' ')
    num_tokens.append(len(words))
num_tokens = sorted(num_tokens)

In [122]:
print(num_tokens[0], num_tokens[-1])

20 350


In [126]:
#we will save the first 400K examples for our fine_tuning. The dataset has been preshuffled and only long enough posts
prepped_ds = raw_ds.select(range(400000))
prepped_ds

Dataset({
    features: ['content', 'summary'],
    num_rows: 400000
})

In [128]:
#we'll save the raw dataset to our local computer to avoid reloading the full dataset from Huggingface
prepped_ds.save_to_disk('C:/Users/VR/.cache/huggingface/datasets/prepped')

Loading cached processed dataset at C:\Users\VR\.cache\huggingface\datasets\reddit\default\1.0.0\98ba5abea674d3178f7588aa6518a5510dc0c6fa8176d9653a3546d5afcb3969\cache-05f98df0e5f5a67d.arrow


In [129]:
prepped_ds = load_from_disk('C:/Users/VR/.cache/huggingface/datasets/prepped')
prepped_ds

Dataset({
    features: ['content', 'summary'],
    num_rows: 400000
})

In [130]:
ds=prepped_ds.train_test_split(test_size = 0.1)
ds

DatasetDict({
    train: Dataset({
        features: ['content', 'summary'],
        num_rows: 360000
    })
    test: Dataset({
        features: ['content', 'summary'],
        num_rows: 40000
    })
})

In [49]:
from pynvml import *
def print_gpu_utilization():
    nvmlInit()
    handle = nvmlDeviceGetHandleByIndex(0)
    info = nvmlDeviceGetMemoryInfo(handle)
    print(f"GPU memory occupied: {info.used//1024**2} MB.")


def print_summary(result):
    print(f"Time: {result.metrics['train_runtime']:.2f}")
    print(f"Samples/second: {result.metrics['train_samples_per_second']:.2f}")
    print_gpu_utilization()

In [50]:
print_gpu_utilization()

GPU memory occupied: 754 MB.
